nlpllmalignmentrl

RLHF

Reinforcement Learning from Human Feedback — aligning LLMs with human preferences via reward models and PPO.

Уровни глубины

L0Intro~0ч

Knows RLHF is used to align ChatGPT/Claude to human preferences; knows it involves reward models and RL.

L1Basics~8ч

Understands the three phases: SFT -> reward model -> PPO; can read the InstructGPT paper and explain each step.

L2Working~25ч

Implements a basic RLHF pipeline; understands reward hacking, KL penalty, and preference data collection.

L3Advanced~40ч

Compares RLHF vs DPO vs RLAIF; designs preference datasets; analyses reward model failure modes.

L4Research~80ч

Contributes to constitutional AI, scalable oversight, or alignment research.

L0 — Intro

L1 — Basics

L2 — Working

L3 — Advanced