Automatone
HomeBlogAbout

Automatone

AI tools, dev workflows, and automation. No hype, just what works.

Pages

HomeBlogAbout

Connect

GitHubRSS Feed

© 2026 Automatone. Built with Next.js.

Admin
  1. Home
  2. ›Troubleshooting
  3. ›Fix "CUDA out of memory" Errors in PyTorch

Fix "CUDA out of memory" Errors in PyTorch

Sanchez Kim
Sanchez Kim
AI Engineer · June 17, 2026 · 8 min read

CUDA out of memory in PyTorch often fires while nvidia-smi still shows free VRAM, because the caching allocator needs a contiguous block, not just free space. This guide gives a copy-paste triage path from fastest fixes to fragmentation tuning, plus how to tell a real leak apart from fragmentation.

#PyTorch#CUDA#GPU memory#out of memory#deep learning#machine learning#troubleshooting#mixed precision#model training
Fix "CUDA out of memory" Errors in PyTorch

torch.cuda.OutOfMemoryError: CUDA out of memory is one of those errors that lies to you. You'll see it fire while nvidia-smi still reports a few gigabytes free, and your first instinct — "I just need a bigger GPU" — is often wrong. PyTorch doesn't hand memory straight back to the driver. It runs a caching allocator, so the number that actually decides whether your allocation succeeds isn't total free VRAM. It's whether the allocator can find a single contiguous block big enough.

That distinction splits every OOM into one of two problems:

  1. You genuinely need more memory than you have. Model weights + activations + optimizer state + batch exceed VRAM. No trick changes physics here; you have to use less.
  2. Fragmentation. Enough free memory exists, but it's scattered across non-contiguous blocks, so a large request fails anyway. The tell is in the error itself: reserved is much larger than allocated.

The fix path is different for each, so read the error before you start randomly lowering batch sizes.

Read the error message first

The standard message looks like this:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 24.00 GiB total capacity; 19.40 GiB already allocated;
512.00 MiB free; 22.10 GiB reserved in total by PyTorch)

Two numbers matter. allocated is memory held by live tensors. reserved is what PyTorch grabbed from the driver and is now caching for reuse. You can print them yourself:

import torch
print(f"allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(torch.cuda.memory_summary())   # full breakdown

The decision rule is simple. If reserved sits far above allocated — say 22 GiB reserved against 12 GiB allocated — the memory is there but fragmented; jump to the fragmentation section. If reserved and allocated are close and both near your card's capacity, you genuinely need to use less. Start with the fast fixes.

A flowchart distinguishing a fragmentation problem from a genuine out-of-memory problem based on reserved versus allocated memory

Fastest fixes, in order

Try these top to bottom. The first one solves most cases.

Lower the batch size. It's the single highest-leverage knob, and activation memory scales roughly linearly with it. Halve it and re-run before doing anything cleverer.

Wrap inference and evaluation in torch.inference_mode(). If you're computing predictions, validating, or running generation, you don't need the autograd graph — and keeping it pins every intermediate activation in memory. inference_mode() is the modern replacement for no_grad(): it's stricter and faster, so prefer it unless you later need those tensors in an autograd computation.

model.eval()
with torch.inference_mode():
    preds = model(batch)

Note that model.eval() alone does not disable gradient tracking — it only flips layers like dropout and batchnorm. You still need the context manager.

Don't accumulate tensors that carry graph history. This is the classic OOM that grows over time. Summing the raw loss tensor into a running total keeps the entire computation graph alive across iterations:

running_loss += loss          # wrong: holds the graph, leaks every step
running_loss += loss.item()   # right: pulls out a plain Python float

Same rule for any list you append GPU tensors to for later logging — call .detach() (or .item() / .cpu()) first.

del big intermediates and clear the cache between phases. Going from training to evaluation in the same process is a common spot to free things up:

del optimizer, train_outputs
torch.cuda.empty_cache()

One caveat people miss: empty_cache() only returns cached (reserved-but-unused) blocks to the driver. It does nothing for live tensors, and calling it inside your training loop just forces PyTorch to re-request memory it was about to reuse, which costs throughput. Use it at phase boundaries, not every step.

Fit a bigger model without buying a GPU

When you've trimmed the obvious waste and still don't fit, reach for techniques that change the memory/compute trade-off. They compose — you can stack mixed precision, accumulation, and checkpointing together.

Technique What it trades When to reach for it
Lower batch size Throughput, noisier gradients First thing, always
Mixed precision (AMP) Small fp16 numerical risk Almost always worth it on modern GPUs
Gradient accumulation Wall-clock time Need a large effective batch
Gradient checkpointing ~1 extra forward pass Deep, activation-bound models
8-bit optimizer / offload / ZeRO Complexity, some speed Optimizer state or weights don't fit

Gradient accumulation simulates a large batch by summing gradients over several micro-batches before stepping:

accum_steps = 4
optimizer.zero_grad()
for i, (x, y) in enumerate(loader):
    loss = loss_fn(model(x), y) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Automatic Mixed Precision (AMP) runs most ops in fp16 or bf16, which cuts activation memory and speeds up Tensor Core math. Exact savings depend on the model, but it's usually the cheapest large win after batch size:

scaler = torch.amp.GradScaler("cuda")
for x, y in loader:
    optimizer.zero_grad()
    with torch.autocast("cuda", dtype=torch.bfloat16):
        loss = loss_fn(model(x), y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

With bfloat16 you can often skip the GradScaler entirely; it mainly exists to keep float16 gradients from underflowing.

Gradient (activation) checkpointing stores only a subset of activations during the forward pass and recomputes the rest during backward. You pay roughly one extra forward pass in exchange for a big drop in activation memory — worth it for deep transformers:

from torch.utils.checkpoint import checkpoint
out = checkpoint(my_block, x, use_reentrant=False)

If the optimizer state or the weights themselves are what blow up, look at 8-bit optimizers (bitsandbytes), CPU offload, or sharding the state across GPUs with FSDP or DeepSpeed ZeRO. Those are their own rabbit holes, but they're the right tools when a single device can't hold the full state.

A conceptual diagram of gradient checkpointing trading recomputation for lower activation memory

Fragmentation: free memory that you still can't use

If your error shows reserved towering over allocated, you're fighting the allocator, not your model size. Each cudaMalloc returns an independent segment that can never merge with another, so a workload that allocates and frees blocks of varying sizes ends up with free memory chopped into pieces too small to satisfy a big request.

The modern fix is expandable_segments. Instead of grabbing fixed segments, the allocator reserves a virtual address range and maps physical memory as the segment grows, which lets it expand rather than fragment. Set it before your process starts:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py

It's still opt-in in current PyTorch, not the default. One reported torchtune case (A100 80GB) saw peak usage drop from 16.39 GiB to 10.83 GiB — about a third — just by flipping this on, with no speed penalty. Your mileage will vary, but it's a one-line thing to try and it costs nothing to test.

PYTORCH_CUDA_ALLOC_CONF key Purpose Status
expandable_segments:True Segments grow instead of fragmenting Recommended, opt-in
max_split_size_mb:N Cap block splitting to limit fragmentation Older knob, largely superseded
garbage_collection_threshold:0.8 Reclaim cached blocks proactively above a fill ratio Situational

max_split_size_mb predates expandable_segments and was the old way to tame fragmentation by stopping the allocator from carving large blocks into unreclaimable small ones. Try expandable_segments first; reach for max_split_size_mb only if the newer option doesn't help or isn't available in your build.

Find the real leak

If memory climbs steadily across epochs until it OOMs, that's not fragmentation — it's a leak, and clearing the cache won't save you. Log allocated memory per step and watch the trend:

if step % 50 == 0:
    print(f"step {step}: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

A monotonic climb points at references you're not releasing: loss tensors retained with their graph, lists of GPU tensors that keep growing, forward hooks holding onto outputs, or anything stored without .detach().

To see exactly what's alive and where it was allocated, record a memory history and inspect the snapshot:

torch.cuda.memory._record_memory_history(max_entries=100_000)
# ... run the workload that leaks ...
torch.cuda.memory._dump_snapshot("snapshot.pickle")

Drag the resulting .pickle into the visualizer at pytorch.org/memory_viz — it renders allocations over time, right in the browser, so you can trace a growing band back to the line that created it.

TL;DR checklist

When the error hits, work down this list:

  1. Read reserved vs allocated — fragmentation or genuinely full?
  2. Lower the batch size.
  3. Wrap inference and eval in torch.inference_mode().
  4. Stop accumulating graph-carrying tensors (.item() / .detach() your logs).
  5. Turn on AMP (torch.autocast).
  6. Add gradient accumulation for a larger effective batch.
  7. Add gradient checkpointing for deep models.
  8. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for fragmentation.
  9. Still climbing over epochs? Profile with the memory snapshot tool and hunt the leak.

References

  • PyTorch — CUDA semantics & memory management
  • Understanding CUDA Memory Usage (snapshot & visualizer)
  • PyTorch blog — Understanding GPU Memory: Visualizing Allocations Over Time
  • Hugging Face Transformers — single-GPU training performance
  • torchtune issue #1185 — expandable_segments VRAM reduction

On this page

  • Read the error message first
  • Fastest fixes, in order
  • Fit a bigger model without buying a GPU
  • Fragmentation: free memory that you still can't use
  • Find the real leak
  • TL;DR checklist
  • References