nlpllminferencesystems
LLM inference
Serving large language models efficiently — batching, speculative decoding, throughput/latency tradeoffs.
Уровни глубины
L0Intro~1ч
Understands that running an LLM requires GPU memory and that batching multiple requests improves throughput.
L1Basics~8ч
Runs a model with vLLM or llama.cpp; understands TTFT, TPS, batch size, and tensor parallelism basics.
L2Working~20ч
Tunes serving parameters for throughput vs latency; applies continuous batching, quantisation, and KV cache strategies.
L3Advanced~40ч
Understands speculative decoding, disaggregated prefill/decode, tensor/pipeline parallelism at depth; profiles and optimises a serving stack.
L4Research~80ч
Contributes to inference efficiency research, hardware-aware serving, or LLM compiler backends.
Ресурсы
L0 — Intro
L1 — Basics
L2 — Working