Automatone
HomeBlogAbout

Automatone

AI tools, dev workflows, and automation. No hype, just what works.

Pages

HomeBlogAboutPrivacyTerms

Connect

GitHubRSS Feed

© 2026 Automatone. All rights reserved.

Admin
  1. Home
  2. ›AI & ML
  3. ›How LoRA Fine-Tuning Works (and When to Use It)

How LoRA Fine-Tuning Works (and When to Use It)

Sanchez Kim
Sanchez Kim
AI Engineer · June 19, 2026 · 8 min read

LoRA freezes a model's pretrained weights and trains a tiny low-rank update instead, cutting trainable parameters by ~10,000x and GPU memory by ~3x. This explainer covers how the mechanism works, what QLoRA and the variants add, and an honest, research-backed framework for when LoRA wins and when full fine-tuning still beats it.

#LoRA#QLoRA#fine-tuning#PEFT#LLM#machine learning#Hugging Face
How LoRA Fine-Tuning Works (and When to Use It)

Full fine-tuning of a modern LLM means rewriting every weight in the model. For a 7-billion-parameter model that's gigabytes of optimizer state, a fresh multi-gigabyte checkpoint for every task you train, and enough GPU memory that you're renting datacenter hardware. LoRA breaks that cost structure. Instead of updating all the weights, you freeze them and train a tiny side module — on the order of 10,000× fewer trainable parameters and roughly 3× less GPU memory at GPT-3 175B scale, while matching full fine-tuning on quality (Hu et al., 2021). That economics shift is why LoRA is now the default parameter-efficient fine-tuning (PEFT) method.

This isn't a "clone this notebook" walkthrough. The goal is to make the mechanism legible, then give you an honest framework for when LoRA is the right call — including the 2024–2025 research showing it isn't a free lunch in every regime.

The core idea

A linear layer holds a weight matrix W₀. Fine-tuning learns an update ΔW and uses W₀ + ΔW. LoRA's bet is that for a given downstream task, ΔW has low intrinsic rank — the meaningful change lives in a small subspace, even though W₀ itself is full-rank. So instead of learning the full ΔW (which is the same size as W₀), you factor it into two skinny matrices:

ΔW = B · A      where A is r×k, B is d×r, and r ≪ min(d, k)

You freeze W₀ and train only A and B. The forward pass becomes:

h = W₀x + (α/r) · B·A·x

If W₀ is 4096×4096 (~16.8M parameters) and you pick rank r = 16, then A and B together hold about 131K parameters — under 1% of the original layer. A starts from a random Gaussian, B starts at zero, so ΔW is zero at step one and training begins exactly where the pretrained model left off.

One deployment win falls out of the math: at inference you can fold B·A back into W₀. The merged matrix has the same shape as the original, so there's no added latency — unlike adapter-layer methods that bolt extra modules into the forward pass.

Diagram showing the frozen weight matrix W0 in parallel with a low-rank B times A update path summing into the output

The knobs that actually matter

Four settings carry most of the weight:

  • Rank r — the size of the bottleneck. Bigger r means more capacity and more parameters. Common values run 8–64; large supervised datasets can need 256 or more.
  • Alpha (α) — a scaling factor. The update is scaled by α/r, so α controls how strongly the adapter pushes on the base model. A frequent default is α = 2r.
  • Target modules — which layers get adapters. The early instinct was "just the attention query/value projections." Don't. Recent analysis is blunt: "attention-only LoRA significantly underperforms MLP-only LoRA" (Thinking Machines, 2025). Apply LoRA to all linear layers, MLP included.
  • Dropout — light regularization on the adapter path, useful on smaller datasets.

QLoRA: fine-tuning a 65B model on one GPU

LoRA shrinks the trainable footprint, but you still hold the full base model in memory to run the forward pass. QLoRA attacks that too: quantize the frozen base to 4-bit and backpropagate through it into the LoRA adapters (Dettmers et al., 2023). That's enough savings to fine-tune a 65B model on a single 48 GB GPU while preserving 16-bit fine-tuning task performance.

Three pieces make it work:

  1. 4-bit NormalFloat (NF4) — a data type that's information-theoretically optimal for the normally-distributed weights you find in a trained network.
  2. Double quantization — quantize the quantization constants themselves, squeezing out more memory.
  3. Paged optimizers — use NVIDIA unified memory to absorb the memory spikes that would otherwise OOM the run.

The headline result: the Guanaco models hit 99.3% of ChatGPT's level on the Vicuna benchmark after about 24 hours of fine-tuning on a single GPU.

The variant landscape

The names multiply fast. Here's the short version so you can navigate them.

Method One-line idea Use when
LoRA Low-rank update B·A on frozen weights The default starting point
QLoRA LoRA on a 4-bit quantized base You're memory-bound (consumer GPU, large model)
DoRA Splits the update into magnitude + direction Low ranks, where plain LoRA underperforms (non-quantized linear layers only)
LoRA+ Different learning rates for A vs B Squeezing more efficiency out of large-model training
rsLoRA Rank-stabilized scaling (α/√r instead of α/r) Training at higher ranks without instability

Hugging Face PEFT is the standard library that implements all of these and keeps adding more — recent release cycles bundled in roughly nine new methods at once. For most people, LoRA and QLoRA cover the job; reach for the variants when you hit their specific failure mode.

A decision flowchart comparing LoRA and full fine-tuning across dataset size and hardware budget

When to use it (and when not to)

The honest version, backed by 2025 research rather than vibes.

LoRA is the right call for most post-training. Instruction tuning, domain adaptation, reasoning fine-tunes, and RL on modest datasets — LoRA matches full fine-tuning here, as long as the dataset doesn't exceed the adapters' storage capacity (Thinking Machines, 2025). The mental model: LoRA holds parity while there's enough trainable capacity for the information in your data, and it degrades gracefully once you push past that — into pretraining-scale data.

A few specifics worth internalizing:

  • Learning rate runs ~10× higher than full fine-tuning (closer to 15× for very short runs). This is the single most common mistake — people reuse their full-FT learning rate and conclude LoRA "doesn't work."
  • Rank scales with task. Supervised instruction tuning on large datasets may want rank 256+. RL and policy-gradient methods are so sample-light that LoRA can match full FT even at rank 1.

Where full fine-tuning still wins: when the dataset is genuinely huge — beyond what the low-rank update can absorb — and when you care about the model staying a faithful model of its pretraining distribution. That last point has a name now. LoRA vs Full Fine-tuning: An Illusion of Equivalence (Shuttleworth et al., 2024) shows that even when target-task accuracy is identical, LoRA introduces new high-ranking singular vectors — "intruder dimensions" — that full fine-tuning doesn't. The practical fallout: LoRA-tuned models can become worse models of the original distribution and adapt less robustly across a sequence of tasks. Higher ranks shrink the effect. So if you're chaining several fine-tunes on top of each other, weigh that.

The serving superpower

Here's the part that changes deployment, not just training. Because every adapter shares the same frozen base, an extra adapter costs only about 50 MB of GPU memory. You can keep dozens of them resident on one GPU and hot-swap per request. Serving engines like vLLM hold multiple LoRA adapters in memory and select one per incoming request with negligible switch time.

That turns the multi-tenant math around. One base model, many small per-customer or per-task adapters, swapped on the fly — instead of one full fine-tuned model per tenant eating its own GPU.

A minimal config

With peft + transformers, the LoRA setup is small:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,          # alpha = 2r is a common default
    target_modules="all-linear",   # not just attention — MLP too
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()   # sanity-check the %

Set your learning rate well above what you'd use for full fine-tuning, run print_trainable_parameters() to confirm you're training a fraction of a percent, and start at r=16 before reaching for anything fancier.

The decision, in one table

If you... Use
Are instruction-tuning or domain-adapting on a modest dataset LoRA
Are memory-bound — big model, one consumer/single GPU QLoRA
Need low rank and want more out of it DoRA
Have pretraining-scale data exceeding adapter capacity Full fine-tuning
Will stack several sequential fine-tunes and need robustness Full FT, or LoRA at higher rank
Want to serve many tasks/tenants from shared hardware LoRA adapters + per-request swapping

Start with LoRA at rank 16 across all linear layers, a learning rate ~10× your full-FT default, and only escalate when you hit a wall you can name.

References

  • LoRA — Hu et al., 2021 — https://arxiv.org/abs/2106.09685
  • QLoRA — Dettmers et al., 2023 — https://arxiv.org/abs/2305.14314
  • LoRA vs Full Fine-tuning: An Illusion of Equivalence — Shuttleworth et al., 2024 — https://arxiv.org/abs/2410.21228
  • LoRA Without Regret — Thinking Machines Lab, 2025 — https://thinkingmachines.ai/blog/lora/
  • LoRA+ — https://arxiv.org/abs/2402.12354
  • DoRA — https://nbasyl.github.io/DoRA-project-page/
  • Hugging Face PEFT — https://huggingface.co/docs/peft/en/index
  • HF 4-bit / QLoRA blog — https://huggingface.co/blog/4bit-transformers-bitsandbytes
  • vLLM LoRA adapters — https://docs.vllm.ai/en/latest/features/lora/

Related Posts

What Is RAG? Retrieval-Augmented Generation Explained
Jun 17, 2026·9 min read

What Is RAG? Retrieval-Augmented Generation Explained

Retrieval-Augmented Generation (RAG) grounds an LLM's answers in information it pulls from an external knowledge source at query time, instead of relying on frozen training data. Here's what RAG is, how the indexing and retrieval pipelines actually work, and when to choose it over fine-tuning or long-context.

AI & ML
How Speculative Decoding Speeds Up LLM Inference
Jun 17, 2026·7 min read

How Speculative Decoding Speeds Up LLM Inference

Speculative decoding makes LLM inference 2-3x faster by letting a small draft model guess ahead and a large model verify the guesses in one parallel pass. A rejection-sampling step keeps the output mathematically identical to the slow path. Here's how it works, why it's lossless, and where it stops helping.

AI & ML

On this page

  • The core idea
  • The knobs that actually matter
  • QLoRA: fine-tuning a 65B model on one GPU
  • The variant landscape
  • When to use it (and when not to)
  • The serving superpower
  • A minimal config
  • The decision, in one table
  • References