MountainAI
Войти
nlptransformers

Tokenization

BPE, WordPiece, SentencePiece — how text becomes tokens and why token choice matters.

Уровни глубины

L0Intro~0ч

Understands that language models process tokens, not raw characters; knows vocabulary size matters.

L1Basics~5ч

Uses HuggingFace tokenizers; understands BPE algorithm; handles special tokens, padding, truncation.

L2Working~12ч

Trains custom tokenizers; understands vocabulary coverage, fertility, multilingual tokenisation.

L3Advanced~20ч

Analyses tokenisation's impact on arithmetic reasoning, code, and multilingual models; implements byte-level BPE.

L4Research~50ч

Contributes to tokenisation-free models, character-level architectures, or byte-level language models.

Ресурсы