rlpolicydeep-rlllm-alignment
Proximal policy optimization (PPO)
Clipped surrogate objective for stable on-policy updates — the workhorse behind RLHF and most modern policy learning.
Уровни глубины
L0Intro~2ч
Knows PPO trains by clipping the ratio between new and old policy.
L1Basics~12ч
Reads the PPO-clip objective; trains PPO on a Gym environment with stable-baselines3.
L2Working~20ч
Tunes ε, epochs, GAE λ; debugs KL blowups; implements PPO from scratch.
L3Advanced~30ч
Understands trust-region interpretation (TRPO); applies PPO to RLHF reward models.
L4Research~60ч
Contributes alternatives (DPO, IPO, KTO); theoretical analysis of policy-improvement bounds.