rlalignment

Policy gradient methods

REINFORCE, PPO, and actor-critic — optimising policies directly for use in RLHF and robotics.

Уровни глубины

L0Intro~0ч

Knows policy gradient methods directly optimise the policy rather than estimating value functions.

L1Basics~10ч

Derives REINFORCE gradient estimator; implements a basic actor-critic for CartPole.

L2Working~25ч

Implements PPO with GAE, entropy regularisation, and clip objective; applies to continuous control.

L3Advanced~35ч

Understands trust-region methods (TRPO), natural policy gradients; analyses variance reduction techniques.

L4Research~70ч

Contributes to multi-agent policy gradient, offline policy optimisation, or RLHF-scale training.

Ресурсы

L1 — Basics

▶
David Silver RL Course — Policy Gradient
Silver, Daviden~1ч

L2 — Working

L3 — Advanced

📄
Trust Region Policy Optimization
Schulman, John et al.en~3ч

Ведёт к

Требует знания

← Обратно к графу Предложить правку