MountainAI
Войти
rlfoundationsexploration

Multi-armed bandits

Stateless RL — ε-greedy, UCB, Thompson sampling for the exploration/exploitation tradeoff.

Уровни глубины

L0Intro~1ч

Frames A/B testing as a bandit problem.

L1Basics~6ч

Implements ε-greedy, decaying ε; derives regret bounds intuitively.

L2Working~12ч

Uses UCB1, Thompson sampling; applies contextual bandits (LinUCB).

L3Advanced~25ч

Proves √T regret bounds; best-arm identification; adversarial bandits (Exp3).

L4Research~50ч

Contextual bandits with nonlinear features; non-stationary bandits.

Ресурсы