rlalignment
Policy gradient methods
REINFORCE, PPO, and actor-critic — optimising policies directly for use in RLHF and robotics.
Уровни глубины
L0Intro~0ч
Knows policy gradient methods directly optimise the policy rather than estimating value functions.
L1Basics~10ч
Derives REINFORCE gradient estimator; implements a basic actor-critic for CartPole.
L2Working~25ч
Implements PPO with GAE, entropy regularisation, and clip objective; applies to continuous control.
L3Advanced~35ч
Understands trust-region methods (TRPO), natural policy gradients; analyses variance reduction techniques.
L4Research~70ч
Contributes to multi-agent policy gradient, offline policy optimisation, or RLHF-scale training.
Ресурсы
L2 — Working