5 lessons

Inference & Serving

Ship the model — make it fast, cheap, and production-ready

Lessons

01
Quantization Basics
Why lower precision is an (almost) free lunch.
MediumOpen
02
INT8 & INT4 Quantization
Post-training quantization and QAT, in detail.
HardOpen
03
Speculative Decoding
A small model drafts, the big model verifies — 2-3× faster.
HardOpen
04
Continuous Batching
The throughput trick that makes production LLMs economical.
HardOpen
05
Paged Attention
vLLM's virtual-memory-inspired KV cache.
HardOpen