nlpllminferencesystems

LLM inference

Serving large language models efficiently — batching, speculative decoding, throughput/latency tradeoffs.

Уровни глубины

L0Intro~1ч

Understands that running an LLM requires GPU memory and that batching multiple requests improves throughput.

L1Basics~8ч

Runs a model with vLLM or llama.cpp; understands TTFT, TPS, batch size, and tensor parallelism basics.

L2Working~20ч

Tunes serving parameters for throughput vs latency; applies continuous batching, quantisation, and KV cache strategies.

L3Advanced~40ч

Understands speculative decoding, disaggregated prefill/decode, tensor/pipeline parallelism at depth; profiles and optimises a serving stack.

L4Research~80ч

Contributes to inference efficiency research, hardware-aware serving, or LLM compiler backends.

L0 — Intro

L1 — Basics

L2 — Working

L3 — Advanced