nlpllminferencesystems
KV cache
Key-value caching in transformer inference — reducing redundant computation during autoregressive generation.
Уровни глубины
L0Intro~0ч
Knows KV cache avoids recomputing past tokens during generation; understands it uses memory.
L1Basics~5ч
Explains what keys and values are cached; calculates memory footprint for a given model and context length.
L2Working~15ч
Applies paged attention (vLLM), prefix caching, and sliding-window cache; tunes batch size vs cache tradeoff.
L3Advanced~25ч
Implements custom caching policies; understands GQA/MQA impact on cache size; analyses decode throughput.
L4Research~60ч
Contributes to KV compression, speculative decoding, or sub-quadratic attention with persistent state.
Ресурсы
L2 — Working