rlfoundationsexploration

Multi-armed bandits

Stateless RL — ε-greedy, UCB, Thompson sampling for the exploration/exploitation tradeoff.

Уровни глубины

L0Intro~1ч

Frames A/B testing as a bandit problem.

L1Basics~6ч

Implements ε-greedy, decaying ε; derives regret bounds intuitively.

L2Working~12ч

Uses UCB1, Thompson sampling; applies contextual bandits (LinUCB).

L3Advanced~25ч

Proves √T regret bounds; best-arm identification; adversarial bandits (Exp3).

L4Research~50ч

Contextual bandits with nonlinear features; non-stationary bandits.

Ресурсы

L1 — Basics

L2 — Working

📄
A Contextual-Bandit Approach to Personalized News Article Recommendation
Li, Chu, Langford, Schapireen~3ч

Ведёт к

Требует знания

PrerequisiteProbability theoryL1 needed

← Обратно к графу Предложить правку