nlpllmalignmentrl
RLHF
Reinforcement Learning from Human Feedback — aligning LLMs with human preferences via reward models and PPO.
Уровни глубины
L0Intro~0ч
Knows RLHF is used to align ChatGPT/Claude to human preferences; knows it involves reward models and RL.
L1Basics~8ч
Understands the three phases: SFT -> reward model -> PPO; can read the InstructGPT paper and explain each step.
L2Working~25ч
Implements a basic RLHF pipeline; understands reward hacking, KL penalty, and preference data collection.
L3Advanced~40ч
Compares RLHF vs DPO vs RLAIF; designs preference datasets; analyses reward model failure modes.
L4Research~80ч
Contributes to constitutional AI, scalable oversight, or alignment research.