nlpllminferencesystems

KV cache

Key-value caching in transformer inference — reducing redundant computation during autoregressive generation.

Уровни глубины

L0Intro~0ч

Knows KV cache avoids recomputing past tokens during generation; understands it uses memory.

L1Basics~5ч

Explains what keys and values are cached; calculates memory footprint for a given model and context length.

L2Working~15ч

Applies paged attention (vLLM), prefix caching, and sliding-window cache; tunes batch size vs cache tradeoff.

L3Advanced~25ч

Implements custom caching policies; understands GQA/MQA impact on cache size; analyses decode throughput.

L4Research~60ч

Contributes to KV compression, speculative decoding, or sub-quadratic attention with persistent state.

L0 — Intro

L1 — Basics

L2 — Working

L3 — Advanced