Automatone
HomeAbout

Automatone

AI tools, dev workflows, and automation. No hype, just what works.

Pages

HomeBlogAboutPrivacyTerms

Connect

GitHubRSS Feed

© 2026 Automatone. All rights reserved.

Admin
  1. Home
  2. ›AI & ML
  3. ›KV Cache Explained: Why LLM Inference Slows Down

KV Cache Explained: Why LLM Inference Slows Down

Sanchez Kim
Sanchez Kim
AI Engineer · June 22, 2026 · 6 min read

LLM generation slows down on long contexts because of one data structure: the KV cache. It grows linearly with every token and must be re-read in full on each decode step, making decode memory-bandwidth bound. Here is the formula, the real numbers, and how GQA, MLA, PagedAttention, and prefix caching fight back.

#llm#kv-cache#inference#gpu#transformers#vllm#gqa#performance
KV Cache Explained: Why LLM Inference Slows Down

Your first few hundred tokens come back fast. By token 8,000 the same model on the same GPU is noticeably slower, and the bill is climbing in a way that doesn't feel linear with the work. Nothing about the model changed. What changed is a data structure that's been quietly growing in GPU memory the whole time: the KV cache.

The slowdown isn't the model "thinking harder" about a long prompt. It's bytes. Every token you generate forces the GPU to re-read a cache that got bigger with the previous token, and moving those bytes — not doing the math — is what caps your tokens per second.

The work you'd otherwise repeat

Transformers generate text autoregressively: one token at a time, each new token attending to every token before it. Attention needs a Key (K) and Value (V) vector for each previous position. Computed naively, every decode step would re-run the K and V projections for the entire history — token 8,000 would recompute the keys and values for tokens 1 through 7,999, throw them away, then do it again for token 8,001. That's quadratic work for something that never changes: the K and V for token 500 are identical no matter how long the sequence gets.

Enter the cache

So you keep them. The KV cache stores the K and V vectors for every token already processed, at every layer, for every attention head. A new token computes only its own K and V, appends them, and attends against everything already in the cache. Per-step attention goes from "recompute over all n tokens" to "compute one, read n." That trade is why generation is practical at all.

The catch is in that word read. You've turned a compute problem into a memory problem, and memory is where it bites.

Diagram of key-value cache blocks stacking up across transformer layers as tokens are generated

It grows, term by term

Here's the whole size story in one line:

KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element

Walk the terms:

  • 2 — you store both K and V.
  • num_layers — every transformer layer keeps its own cache.
  • num_kv_heads — one entry per KV head; this is the term GQA and MLA attack (more below).
  • head_dim — the width of each head's vector.
  • seq_len — total tokens in context. This is the one that runs away.
  • batch_size — concurrent sequences. The cache is per-request.
  • bytes_per_element — 2 for FP16/BF16, 1 for FP8.

Two of these you control at inference time, and both are linear: double the context, double the cache; double the concurrent users, double the cache.

What that costs in practice

Numbers make it real.

Model Precision Context KV cache
Llama-2-7B FP16 per token ~0.5 MB
Llama-2-7B FP16 ~28k tokens ~14 GB
Llama 3.1 70B BF16 128K, 1 user ~42.9 GB

For Llama-2-7B, ~14 GB of cache is roughly the size of the model's own FP16 weights. For Llama 3.1 70B at full 128K context, a single sequence's cache is ~43 GB — and that's before you serve a second user. The pattern: for short prompts, weights dominate memory; for long contexts, the cache dominates, and it's what decides whether the workload fits in VRAM and how fast it runs.

Why a bigger cache means slower tokens

This is the part the title promises, and it comes down to the two phases of inference behaving completely differently.

Prefill — processing your prompt — is compute-bound. The whole prompt goes through in parallel as large matrix–matrix multiplications that saturate the GPU's FLOPs. Cost scales roughly quadratically with prompt length, but the hardware is busy doing math, which is what it's good at.

Decode — generating one token at a time — is memory-bandwidth-bound. Each step does very little math (matrix–vector work for a single new token) but has to stream the model weights and read the entire KV cache out of memory. The GPU isn't limited by how fast it can multiply. It's limited by how fast it can move bytes.

Put those together and the slowdown is mechanical. Every generated token re-reads the whole cache. The cache got bigger when you generated the previous token. More context → more bytes moved per step → fewer tokens per second. The model isn't working harder. It's waiting on memory.

Side-by-side comparison of compute-bound prefill and memory-bandwidth-bound decode phases

The second tax: concurrency

Bandwidth caps single-stream speed. Capacity caps how many streams you can run at once. The cache shares VRAM with the weights, so every gigabyte of cache is a gigabyte you can't spend on batching more users or allowing longer contexts. That's why a server comfortable with 50 short-prompt users can choke on 10 long-context ones — same model, the cache just ate the headroom.

How people fight the cache

Every serious optimization targets a specific term in that formula, or the bandwidth it implies.

Technique What it attacks Effect
Grouped-Query Attention (GQA) num_kv_heads Query heads share fewer KV heads (e.g. 32 → 8) for a ~4× smaller cache with negligible quality loss. Standard across all Llama 3 sizes.
Multi-head Latent Attention (MLA) num_kv_heads / head_dim Stores a small latent vector per token and reconstructs K/V on the fly. ~10× smaller cache; used in DeepSeek-V3.
PagedAttention (vLLM) wasted / fragmented memory Manages the cache in fixed-size blocks like OS paging. Cuts waste from 60–80% down to under 4%, for 2–4× throughput at the same latency.
Prefix caching recompute of shared prefixes Reuses the KV blocks of an identical prompt prefix across requests. Big win for system-prompt-heavy and agentic workloads.
KV quantization / offload bytes_per_element / location FP8 or INT4 cache, or spilling to CPU memory — trade precision or bandwidth for capacity.

GQA and MLA shrink the cache at the architecture level, so you inherit them by picking the right model. PagedAttention and prefix caching are properties of your serving stack — vLLM and SGLang hand them to you. Quantization and offload are the knobs you turn when you're out of VRAM and willing to trade.

The mental model to keep

Context length is a memory-bandwidth tax, and you pay it on every single generated token. That reframes a few decisions. Trimming dead context isn't only about token cost — it directly buys decode speed. Reusing a fixed system-prompt prefix is worth engineering around. And when you pick a serving stack, you're really choosing how well it manages this one cache. The model's quality lives in the weights; your latency lives in the cache.

References

  • Pierre Lienhart — LLM Inference Series: 3. KV caching explained
  • Spheron — KV Cache Optimization Guide
  • Towards Data Science — Prefill Is Compute-Bound. Decode Is Memory-Bound.
  • Kwon et al. — Efficient Memory Management for LLM Serving with PagedAttention (vLLM), arXiv:2309.06180
  • DeepSeek-AI — DeepSeek-V3 Technical Report, arXiv:2412.19437
  • IBM — What is grouped-query attention (GQA)?

Related Posts

How LoRA Fine-Tuning Works (and When to Use It)
Jun 19, 2026·8 min read

How LoRA Fine-Tuning Works (and When to Use It)

LoRA freezes a model's pretrained weights and trains a tiny low-rank update instead, cutting trainable parameters by ~10,000x and GPU memory by ~3x. This explainer covers how the mechanism works, what QLoRA and the variants add, and an honest, research-backed framework for when LoRA wins and when full fine-tuning still beats it.

AI & ML
What Is RAG? Retrieval-Augmented Generation Explained
Jun 17, 2026·9 min read

What Is RAG? Retrieval-Augmented Generation Explained

Retrieval-Augmented Generation (RAG) grounds an LLM's answers in information it pulls from an external knowledge source at query time, instead of relying on frozen training data. Here's what RAG is, how the indexing and retrieval pipelines actually work, and when to choose it over fine-tuning or long-context.

AI & ML
How Speculative Decoding Speeds Up LLM Inference
Jun 17, 2026·7 min read

How Speculative Decoding Speeds Up LLM Inference

Speculative decoding makes LLM inference 2-3x faster by letting a small draft model guess ahead and a large model verify the guesses in one parallel pass. A rejection-sampling step keeps the output mathematically identical to the slow path. Here's how it works, why it's lossless, and where it stops helping.

AI & ML

On this page

  • The work you'd otherwise repeat
  • Enter the cache
  • It grows, term by term
  • What that costs in practice
  • Why a bigger cache means slower tokens
  • The second tax: concurrency
  • How people fight the cache
  • The mental model to keep
  • References