nlptransformers
Tokenization
BPE, WordPiece, SentencePiece — how text becomes tokens and why token choice matters.
Уровни глубины
L0Intro~0ч
Understands that language models process tokens, not raw characters; knows vocabulary size matters.
L1Basics~5ч
Uses HuggingFace tokenizers; understands BPE algorithm; handles special tokens, padding, truncation.
L2Working~12ч
Trains custom tokenizers; understands vocabulary coverage, fertility, multilingual tokenisation.
L3Advanced~20ч
Analyses tokenisation's impact on arithmetic reasoning, code, and multilingual models; implements byte-level BPE.
L4Research~50ч
Contributes to tokenisation-free models, character-level architectures, or byte-level language models.
Ресурсы
L1 — Basics