rlfoundationsexploration
Multi-armed bandits
Stateless RL — ε-greedy, UCB, Thompson sampling for the exploration/exploitation tradeoff.
Уровни глубины
L0Intro~1ч
Frames A/B testing as a bandit problem.
L1Basics~6ч
Implements ε-greedy, decaying ε; derives regret bounds intuitively.
L2Working~12ч
Uses UCB1, Thompson sampling; applies contextual bandits (LinUCB).
L3Advanced~25ч
Proves √T regret bounds; best-arm identification; adversarial bandits (Exp3).
L4Research~50ч
Contextual bandits with nonlinear features; non-stationary bandits.