KV Cache Explained: Why LLM Inference Slows Down

Your first few hundred tokens come back fast. By token 8,000 the same model on the same GPU is noticeably slower, and the bill is climbing in a way that doesn't feel linear with the work. Nothing about the model changed. What changed is a data structure that's been quietly growing in GPU memory the whole time: the KV cache.

The slowdown isn't the model "thinking harder" about a long prompt. It's bytes. Every token you generate forces the GPU to re-read a cache that got bigger with the previous token, and moving those bytes — not doing the math — is what caps your tokens per second.

The work you'd otherwise repeat

Transformers generate text autoregressively: one token at a time, each new token attending to every token before it. Attention needs a Key (K) and Value (V) vector for each previous position. Computed naively, every decode step would re-run the K and V projections for the entire history — token 8,000 would recompute the keys and values for tokens 1 through 7,999, throw them away, then do it again for token 8,001. That's quadratic work for something that never changes: the K and V for token 500 are identical no matter how long the sequence gets.

Enter the cache

So you keep them. The KV cache stores the K and V vectors for every token already processed, at every layer, for every attention head. A new token computes only its own K and V, appends them, and attends against everything already in the cache. Per-step attention goes from "recompute over all n tokens" to "compute one, read n." That trade is why generation is practical at all.

The catch is in that word read. You've turned a compute problem into a memory problem, and memory is where it bites.

Diagram of key-value cache blocks stacking up across transformer layers as tokens are generated

It grows, term by term

Here's the whole size story in one line:

KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element

Walk the terms:

2 — you store both K and V.
num_layers — every transformer layer keeps its own cache.
num_kv_heads — one entry per KV head; this is the term GQA and MLA attack (more below).
head_dim — the width of each head's vector.
seq_len — total tokens in context. This is the one that runs away.
batch_size — concurrent sequences. The cache is per-request.
bytes_per_element — 2 for FP16/BF16, 1 for FP8.

Two of these you control at inference time, and both are linear: double the context, double the cache; double the concurrent users, double the cache.

What that costs in practice

Numbers make it real.

Model	Precision	Context	KV cache
Llama-2-7B	FP16	per token	~0.5 MB
Llama-2-7B	FP16	~28k tokens	~14 GB
Llama 3.1 70B	BF16	128K, 1 user	~42.9 GB

For Llama-2-7B, ~14 GB of cache is roughly the size of the model's own FP16 weights. For Llama 3.1 70B at full 128K context, a single sequence's cache is ~43 GB — and that's before you serve a second user. The pattern: for short prompts, weights dominate memory; for long contexts, the cache dominates, and it's what decides whether the workload fits in VRAM and how fast it runs.

Why a bigger cache means slower tokens

This is the part the title promises, and it comes down to the two phases of inference behaving completely differently.

Prefill — processing your prompt — is compute-bound. The whole prompt goes through in parallel as large matrix–matrix multiplications that saturate the GPU's FLOPs. Cost scales roughly quadratically with prompt length, but the hardware is busy doing math, which is what it's good at.

Decode — generating one token at a time — is memory-bandwidth-bound. Each step does very little math (matrix–vector work for a single new token) but has to stream the model weights and read the entire KV cache out of memory. The GPU isn't limited by how fast it can multiply. It's limited by how fast it can move bytes.

Put those together and the slowdown is mechanical. Every generated token re-reads the whole cache. The cache got bigger when you generated the previous token. More context → more bytes moved per step → fewer tokens per second. The model isn't working harder. It's waiting on memory.

Side-by-side comparison of compute-bound prefill and memory-bandwidth-bound decode phases

The second tax: concurrency

Bandwidth caps single-stream speed. Capacity caps how many streams you can run at once. The cache shares VRAM with the weights, so every gigabyte of cache is a gigabyte you can't spend on batching more users or allowing longer contexts. That's why a server comfortable with 50 short-prompt users can choke on 10 long-context ones — same model, the cache just ate the headroom.

How people fight the cache

Every serious optimization targets a specific term in that formula, or the bandwidth it implies.

Technique	What it attacks	Effect
Grouped-Query Attention (GQA)	num_kv_heads	Query heads share fewer KV heads (e.g. 32 → 8) for a ~4× smaller cache with negligible quality loss. Standard across all Llama 3 sizes.
Multi-head Latent Attention (MLA)	num_kv_heads / head_dim	Stores a small latent vector per token and reconstructs K/V on the fly. ~10× smaller cache; used in DeepSeek-V3.
PagedAttention (vLLM)	wasted / fragmented memory	Manages the cache in fixed-size blocks like OS paging. Cuts waste from 60–80% down to under 4%, for 2–4× throughput at the same latency.
Prefix caching	recompute of shared prefixes	Reuses the KV blocks of an identical prompt prefix across requests. Big win for system-prompt-heavy and agentic workloads.
KV quantization / offload	bytes_per_element / location	FP8 or INT4 cache, or spilling to CPU memory — trade precision or bandwidth for capacity.

GQA and MLA shrink the cache at the architecture level, so you inherit them by picking the right model. PagedAttention and prefix caching are properties of your serving stack — vLLM and SGLang hand them to you. Quantization and offload are the knobs you turn when you're out of VRAM and willing to trade.

The mental model to keep

Context length is a memory-bandwidth tax, and you pay it on every single generated token. That reframes a few decisions. Trimming dead context isn't only about token cost — it directly buys decode speed. Reusing a fixed system-prompt prefix is worth engineering around. And when you pick a serving stack, you're really choosing how well it manages this one cache. The model's quality lives in the weights; your latency lives in the cache.

References

Pierre Lienhart — LLM Inference Series: 3. KV caching explained
Spheron — KV Cache Optimization Guide
Towards Data Science — Prefill Is Compute-Bound. Decode Is Memory-Bound.
Kwon et al. — Efficient Memory Management for LLM Serving with PagedAttention (vLLM), arXiv:2309.06180
DeepSeek-AI — DeepSeek-V3 Technical Report, arXiv:2412.19437
IBM — What is grouped-query attention (GQA)?

The work you'd otherwise repeat

Enter the cache

The catch is in that word read. You've turned a compute problem into a memory problem, and memory is where it bites.

Diagram of key-value cache blocks stacking up across transformer layers as tokens are generated

It grows, term by term

Here's the whole size story in one line:

KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element

Walk the terms:

2 — you store both K and V.
num_layers — every transformer layer keeps its own cache.
num_kv_heads — one entry per KV head; this is the term GQA and MLA attack (more below).
head_dim — the width of each head's vector.
seq_len — total tokens in context. This is the one that runs away.
batch_size — concurrent sequences. The cache is per-request.
bytes_per_element — 2 for FP16/BF16, 1 for FP8.

Two of these you control at inference time, and both are linear: double the context, double the cache; double the concurrent users, double the cache.

What that costs in practice

Numbers make it real.

Model	Precision	Context	KV cache
Llama-2-7B	FP16	per token	~0.5 MB
Llama-2-7B	FP16	~28k tokens	~14 GB
Llama 3.1 70B	BF16	128K, 1 user	~42.9 GB

Why a bigger cache means slower tokens

This is the part the title promises, and it comes down to the two phases of inference behaving completely differently.

Side-by-side comparison of compute-bound prefill and memory-bandwidth-bound decode phases

The second tax: concurrency

How people fight the cache

Every serious optimization targets a specific term in that formula, or the bandwidth it implies.

Technique	What it attacks	Effect
Grouped-Query Attention (GQA)	num_kv_heads	Query heads share fewer KV heads (e.g. 32 → 8) for a ~4× smaller cache with negligible quality loss. Standard across all Llama 3 sizes.
Multi-head Latent Attention (MLA)	num_kv_heads / head_dim	Stores a small latent vector per token and reconstructs K/V on the fly. ~10× smaller cache; used in DeepSeek-V3.
PagedAttention (vLLM)	wasted / fragmented memory	Manages the cache in fixed-size blocks like OS paging. Cuts waste from 60–80% down to under 4%, for 2–4× throughput at the same latency.
Prefix caching	recompute of shared prefixes	Reuses the KV blocks of an identical prompt prefix across requests. Big win for system-prompt-heavy and agentic workloads.
KV quantization / offload	bytes_per_element / location	FP8 or INT4 cache, or spilling to CPU memory — trade precision or bandwidth for capacity.

The mental model to keep

References

Pierre Lienhart — LLM Inference Series: 3. KV caching explained
Spheron — KV Cache Optimization Guide
Towards Data Science — Prefill Is Compute-Bound. Decode Is Memory-Bound.
Kwon et al. — Efficient Memory Management for LLM Serving with PagedAttention (vLLM), arXiv:2309.06180
DeepSeek-AI — DeepSeek-V3 Technical Report, arXiv:2412.19437
IBM — What is grouped-query attention (GQA)?

KV Cache Explained: Why LLM Inference Slows Down

The work you'd otherwise repeat

Enter the cache

It grows, term by term

What that costs in practice

Why a bigger cache means slower tokens

The second tax: concurrency

How people fight the cache

The mental model to keep

References

Related Posts

How LoRA Fine-Tuning Works (and When to Use It)

What Is RAG? Retrieval-Augmented Generation Explained

How Speculative Decoding Speeds Up LLM Inference

KV Cache Explained: Why LLM Inference Slows Down

The work you'd otherwise repeat

Enter the cache

It grows, term by term

What that costs in practice

Why a bigger cache means slower tokens

The second tax: concurrency

How people fight the cache

The mental model to keep

References

Related Posts

How LoRA Fine-Tuning Works (and When to Use It)

What Is RAG? Retrieval-Augmented Generation Explained

How Speculative Decoding Speeds Up LLM Inference