MountainAI
Войти
rlalignment

Policy gradient methods

REINFORCE, PPO, and actor-critic — optimising policies directly for use in RLHF and robotics.

Уровни глубины

L0Intro~0ч

Knows policy gradient methods directly optimise the policy rather than estimating value functions.

L1Basics~10ч

Derives REINFORCE gradient estimator; implements a basic actor-critic for CartPole.

L2Working~25ч

Implements PPO with GAE, entropy regularisation, and clip objective; applies to continuous control.

L3Advanced~35ч

Understands trust-region methods (TRPO), natural policy gradients; analyses variance reduction techniques.

L4Research~70ч

Contributes to multi-agent policy gradient, offline policy optimisation, or RLHF-scale training.

Ресурсы