rlpolicydeep-rlllm-alignment

Proximal policy optimization (PPO)

Clipped surrogate objective for stable on-policy updates — the workhorse behind RLHF and most modern policy learning.

Уровни глубины

L0Intro~2ч

Knows PPO trains by clipping the ratio between new and old policy.

L1Basics~12ч

Reads the PPO-clip objective; trains PPO on a Gym environment with stable-baselines3.

L2Working~20ч

Tunes ε, epochs, GAE λ; debugs KL blowups; implements PPO from scratch.

L3Advanced~30ч

Understands trust-region interpretation (TRPO); applies PPO to RLHF reward models.

L4Research~60ч

Contributes alternatives (DPO, IPO, KTO); theoretical analysis of policy-improvement bounds.

L1 — Basics

L2 — Working