Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adap...
This abstract introduces the core problem and solution of the paper in one clear narrative:
Let's break down each concept:
The abstract opens with this observation:
"An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains."
What this means:
This two-stage approach is powerful because the pretrained model has already learned general language patterns, so it only needs task-specific adjustments.
"As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible."
Full fine-tuning means updating every single parameter in the model. For GPT-3 with 175 billion parameters, this means:
Total: A single fine-tuned GPT-3 instance requires roughly 700+ GB of GPU memory. If you need separate models for 10 different tasks, you'd need 7+ TB total—economically prohibitive.
"We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture"
Let me unpack this mathematically and conceptually:
The original model weights are not updated during training. If the original weight matrix is called , then never changes. This eliminates most of the memory cost.
This is the clever part. Instead of learning updates to directly, you learn two small matrices and such that their product approximates the weight change.
Mathematically, the forward pass computes:
Where:
Why this works:
The term is a rank- matrix. Instead of learning parameters (which could be millions), you only learn:
For a transformer layer with dimensions around 4,000 × 4,000 and :
That's a 250× reduction for a single layer! With 96 transformer layers in GPT-3, this compounds dramatically.
This comes from linear algebra. Any matrix can be decomposed as:
where contains singular values. By keeping only the largest singular values and discarding the rest, you get an approximation. LoRA exploits this idea: the weight changes needed for adaptation are assumed to be low-rank (i.e., most information is concentrated in a few principal directions).
"Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times"
What does this mean concretely?
That's the difference between a 2TB fine-tuned model and a 70MB adapter—something you can easily download and load onto any machine.
"and the GPU memory requirement by 3 times"
With full fine-tuning, memory goes to:
With LoRA:
That's roughly a 4× reduction (the abstract says 3× conservatively, possibly accounting for other overheads).
"LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency"
This is the surprising and important result. Let me break down each claim:
"On-par or better model quality": Despite using 10,000× fewer trainable parameters, LoRA achieves comparable (or sometimes better) performance on downstream tasks. This suggests that task adaptation doesn't actually require changing all parameters—most information can be captured by low-rank updates.
"Higher training throughput": Training is faster because:
"No additional inference latency": Unlike some other parameter-efficient methods (like adapters, which add extra forward passes), LoRA can be merged back into the original weights:
$ W_{\text{merged}} = W_0 + U V^T
(U V^T)$ computation is done once offline, not during each forward pass.
"We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA."
The hypothesis being investigated: When you fine-tune a pretrained model for a new task, the weight changes (the difference between fine-tuned and pretrained weights) are intrinsically low-rank.
In other words, if you computed:
and performed a singular value decomposition:
you'd find that most of the "action" (variance explained) is concentrated in the first few singular values. The large singular values contain ~90% of the information, while the rest are near-zero. This means you can accurately reconstruct most of using only the top components.
This isn't obvious a priori—it's an empirical finding that makes LoRA principled rather than just heuristic.
"We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2"
This means:
| Aspect | Full Fine-Tuning | LoRA |
|---|---|---|
| Trainable Parameters | 175B | 17.5M (1/10,000th) |
| GPU Memory | ~1.4 TB | ~350 GB (1/4th) |
| Inference Speed | Baseline | Same (after merge) |
| Model Quality | 100% | ≥ 100% |
| Training Speed | Baseline | Faster |
| Deployment Cost | Very High | Low |
The Problem is Real: Fine-tuning 175B-parameter models is economically infeasible for large-scale deployment
The Solution is Elegant: By exploiting low-rank structure in weight changes, you can achieve task adaptation with negligible parameters
Theory Meets Practice: The empirical finding that is low-rank explains why this works, elevating LoRA from a trick to a principled method
No Performance Tradeoff: Crucially, you don't sacrifice quality—LoRA matches or exceeds full fine-tuning
This makes LoRA a breakthrough for practical NLP: you get the benefits of task-specific adaptation with the efficiency of parameter-sharing.
Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multipl...
This introduction section is setting up one of the most important challenges in modern machine learning: how do we efficiently adapt massive pre-trained models to specific tasks without the enormous computational and storage costs?
Think of it like this: imagine you've built a highly specialized Swiss Army knife that cost billions of dollars to manufacture. Now you need different versions for different jobs. Do you manufacture an entirely new Swiss Army knife for each task (full fine-tuning)? Or do you find a clever way to add small, task-specific attachments to the original knife while keeping the main tool unchanged (adaptation)?
The introduction explains the problem clearly and then introduces LoRA as an elegant solution. Let me break down the key ideas.
When we fine-tune a pre-trained language model, we update all of its parameters on a specific downstream task. This means:
Storage problem: If the original model has 175 billion parameters (like GPT-3), every fine-tuned version needs 175 billion parameters too. If you want to adapt the model for 100 different tasks, you need 100 copies of 175B parameters each—that's prohibitively expensive.
Computational problem: During training, we need to compute gradients for all parameters and maintain optimizer states (like momentum in Adam), which requires substantial GPU memory.
The paper quantifies the pain: GPT-3 has 175 billion parameters, making independent fine-tuned instances impractical.
The research community has tried to solve this by:
These approaches do reduce parameters, but they have critical drawbacks:
So there's a fundamental trade-off: efficiency vs. quality.
This is the crucial insight that motivated LoRA. Prior work (cited as Li et al. 2018a and Aghajanyan et al. 2020) observed something remarkable:
Although over-parametrized neural networks have many parameters, the actual information they represent lives in a much lower-dimensional space.
To understand this intuitively: imagine a 1000×1000 matrix (1 million numbers). Even though it could contain 1 million independent pieces of information, it might actually be well-represented by just 10 or 20 underlying "factors." The vast majority of the parameters are redundant.
The LoRA authors take this insight further with a hypothesis:
When we adapt a pre-trained model to a new task, the change in the model's weights also has a low intrinsic rank.
Let's formalize this. Denote:
The hypothesis is that can be well-approximated by a low-rank decomposition:
where:
Here, is the dimension of the Transformer layer (could be 768, 3072, 12288, etc.), and is typically just 1, 2, 4, or 8.
Imagine as a 12,288 × 12,288 matrix (as mentioned in the paper). This seems to need 150 million parameters. But if , then:
The low-rank structure captures the essential directions in which the weights need to change, without capturing every minor fluctuation.
The paper shows the approach in Figure 1. Instead of training the full weight matrix , LoRA:
The forward pass becomes:
where is the input, is the output, and represents the learned weight update.
The paper gives a striking concrete example:
Even though is massive, you only need rank 1 or 2! This is the empirical validation of the low-rank hypothesis.
Let me break down the major benefits mentioned:
Instead of storing different fine-tuned models:
This is huge for production systems managing many models!
Unlike adapter layers or other parameter-efficient methods that add extra computation:
This is mathematically elegant: the low-rank update is linear, so you can collapse it into the weights before deployment.
Despite using fewer parameters and less computation, LoRA achieves comparable or better performance on multiple benchmarks (RoBERTa, DeBERTa, GPT-2, GPT-3).
LoRA can be combined with other techniques like prefix-tuning, making it broadly applicable.
The paper establishes notation that you'll see throughout:
| Notation | Meaning |
|---|---|
| Input/output dimension of a Transformer layer (e.g., 768 or 3072) | |
| Query, Key, Value, Output projection matrices in self-attention | |
| or | Pre-trained weight matrix (frozen in LoRA) |
| Accumulated weight change during adaptation | |
| Rank of the LoRA decomposition (usually small: 1-8) | |
| Feedforward network dimension, typically |
The paper follows standard Transformer conventions from Vaswani et al. (2017) and uses Adam optimizer for all experiments.
Recall from the abstract that this work achieves:
The introduction sets up exactly why this matters: modern models are too large to fine-tune independently for each task, and existing solutions either hurt performance or add runtime overhead. LoRA eliminates both problems by exploiting the low-rank structure of weight updates.
While our proposal is agnostic to training objective, we focus on language modeling as our motivating use case. Below is...
This section sets up the core problem that LoRA is trying to solve. The authors are establishing:
Think of it like this: imagine you have a massive pre-trained encyclopedia (GPT-3 with 175 billion parameters), and you want to specialize it for different tasks (summarization, Q&A, etc.). Traditional fine-tuning would create a completely new copy of that encyclopedia for each task. LoRA's insight is: we don't need to change everything—we can make just a few targeted adjustments.
The section starts with:
"Suppose we are given a pre-trained autoregressive language model parametrized by ."
Let me break down this notation:
: This is a conditional probability distribution. It represents the probability of generating output sequence given input sequence . The subscript tells us that this probability depends on the model's parameters.
: This is the set of all trainable parameters in the model. For GPT-3, billion parameters (the vertical bars mean "the size/count of").
Autoregressive: This means the model generates text one token at a time, where each new token depends on all previously generated tokens. This is how GPT models work.
Each downstream task (like summarization) gets its own training dataset:
Breaking this down:
Example: For a summarization task, might be a long article, and might be its summary.
\max_{\Phi} \sum_{(x,y)\in\mathcal{Z}} \sum_{t=1}^{|y|} \log \left( P_\Phi(y_t|x, y_{<t}) \right) \tag{1}
This looks intimidating, but let's unpack it carefully:
The outer structure - "":
The inner structure - "":
Intuition: We want to maximize the probability the model assigns to the actual target tokens. Taking the log turns multiplication into addition, which is mathematically convenient. In practice, we actually minimize the negative log probability (called cross-entropy loss), but that's equivalent.
Concrete example: Suppose we're translating English to French, and:
The equation sums over predicting each French word in sequence, using all previous words as context.
After training via fine-tuning, the pre-trained weights get updated to , where:
For GPT-3 with 175 billion parameters:
Instead of directly learning all of , we encode it using a much smaller set of parameters :
This is the reparameterization shown in Figure 1 from the introduction.
\max_{\Theta} \sum_{(x,y)\in\mathcal{Z}} \sum_{t=1}^{|y|} \log \left( p_{\Phi_0+\Delta\Phi(\Theta)}(y_t|x, y_{<t}) \right) \tag{2}
What changed:
The constraint: (much much less than)
The rest of the equation is identical—we still maximize the same log-probability objective, but now through a different parameterization.
The paper states:
"When the pre-trained model is GPT-3 175B, the number of trainable parameters can be as small as 0.01% of ."
What does this mean?
Compare the two approaches:
| Aspect | Full Fine-Tuning | LoRA |
|---|---|---|
| Trainable parameters | 175 billion | 17.5 million |
| Reduction factor | baseline | 10,000× |
| Parameters per task | 175B | 17.5M |
| Storage for 100 tasks | 17.5 trillion | 1.75 billion |
This is the power of LoRA.
The magic is in how is defined. While the details come in later sections, the intuition is:
Hypothesis: The weight changes during adaptation don't actually require full rank. They live in a low-dimensional subspace.
Mathematically, instead of learning a full weight matrix update, LoRA learns two smaller matrices (A and B in Figure 1) whose product reconstructs the update:
Where:
The total parameters become instead of , which is tiny when .
For GPT-3 where , they find or works great!
The Problem: Full fine-tuning requires storing and updating parameters per task.
The Constraint: We need (orders of magnitude smaller).
The Goal: Reformulate optimization from Equation (1) to Equation (2), where the smaller parameterizes .
The Promise: We'll achieve this via low-rank decomposition of weight updates.
This section is essentially saying: "Instead of tweaking the 175 billion knobs on our model, can we find just a few 'master dials' that, when adjusted, effectively tune all those knobs?" The answer, remarkably, is yes—and that's what LoRA does.
Perfect! The gradient of the log-likelihood is:
where is the sigmoid function. This gradient (which looks like a sigmoid) is used during backpropagation to update .
Key properties:
Here's how the algorithm works:
1. Initialize: Start with pre-trained weights (frozen) and small random
2. Forward pass: For each training example :
3. Compute loss: Accumulate negative log-likelihood:
4. Backward pass: Compute gradients
5. Update: (gradient descent with learning rate )
6. Repeat: Until convergence
The beauty is that only gets updated (e.g., 17.5M parameters), while stays frozen (175B parameters).
| Aspect | Insight |
|---|---|
| Objective | Maximize conditional probability of target tokens given context — standard language modeling |
| Innovation | Encode in low-rank format parameterized by |
| Log-transform | Converts product of probabilities to sum, improving numerical stability |
| Double sum | Outer sum over all training pairs; inner sum over token positions in each target |
| Gradient flow | Sigmoid-shaped gradients ( to ) ensure stable training |
| Efficiency | 100,000× reduction in trainable parameters for GPT-3 (175B→17.5M) |
| Practical impact | Can store 10,000 task-specific LoRA adapters for same memory as one full model |
This elegant formulation makes adapting giant language models feasible for everyone, not just organizations with massive compute budgets!
Show how log converts multiplication into addition

Visualizing the log-probability objective: showing why higher probabilities yield larger log values



Calculate log-likelihood for a 3-token sequence with given per-token probabilities




Show the joint probability corresponding to the log-likelihood


Calculate the scale of trainable parameters in LoRA vs full fine-tuning for GPT-3




Show the gradient of log-likelihood (log-sigmoid) used during backpropagation




Perfect! The derivative is:
This is crucial: When we have a small probability , the derivative is , so the gradient is large and we make big updates. When , the derivative is , so gradient updates are gentler. This automatic weighting—where wrong predictions get stronger gradients—is a beautiful property of the log-likelihood objective!
The paper then introduces Equation (2), which is the key contribution of LoRA:
The critical difference:
LoRA's Innovation: Rather than learning the full matrix (which would be huge), LoRA decomposes it as a low-rank product:
where:
For GPT-3 with dimensions and :
Equation (1) is the foundational training objective:
This is the conceptual foundation that motivates everything that follows in the LoRA paper.
Show how the logarithm converts a product decomposition into a sum

Visualizing the log-probability objective: showing why higher probabilities yield larger log values



Computing the concrete objective value for a training example




Computing the objective when the third token prediction improves from 0.1 to 0.9




Finding the derivative of the log term to understand how the policy responds to incentives




The problem we set out to tackle is by no means new. Since the inception of transfer learning, dozens of works have soug...
Before proposing LoRA, the paper needs to convince you that existing methods for efficient model adaptation are inadequate. This section accomplishes that by examining two major approaches that researchers had already tried, then systematically explaining why each one falls short in practical, large-scale scenarios.
Think of this as the paper saying: "We can't just use what's already out there—here's why."
Recall from Section 2 that we want to find a small set of parameters such that:
where (the trainable parameters are much smaller than the original model).
By 2021, when this paper was written, the NLP community had already explored two main strategies for this problem:
The authors now explain why both approaches have critical flaws for large-scale, production-ready systems.
Adapter layers are small neural network modules inserted between (or within) the layers of a Transformer. The idea is simple: instead of updating all parameters, you add a few extra layers with much fewer parameters.
The two main variants mentioned:
Here's the subtle but critical problem:
The math perspective: Consider a Transformer block's forward pass. Normally, a block takes input and produces output in a single parallel operation. With adapters, you now have:
or
The adapter operations must execute sequentially after the main block finishes. There's no way to parallelize them with other blocks.
Why this matters in practice:
Concrete example from the paper: [Table 1] shows actual latency measurements on GPT-2 medium (a 355M parameter model):
The table demonstrates that even with "small" adapters, inference latency increases noticeably. This might seem like a minor increase, but at large scale (e.g., serving millions of requests), this compounds into significant cost.
When you split a model across multiple GPUs (model sharding—see Shoeybi et al. 2020), the adapter problem amplifies:
This is a major issue for deploying very large models like GPT-3 (175B parameters), which cannot fit on a single GPU.
Prefix tuning (Li & Liang, 2021) takes a different approach: instead of modifying model weights, you prepend learnable "prompt tokens" to the input sequence. The model then adapts by learning what these prompt tokens should be.
Mathematically, if the original input sequence is , prefix tuning creates:
where are learnable parameters and is relatively small.
Problem 1: Non-monotonic performance
As you increase the number of learnable prompt tokens (), the model's performance doesn't improve smoothly. Instead, performance fluctuates unpredictably. The paper states:
"prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters"
What does "non-monotonic" mean here?
In mathematics, a function is monotonic if it only increases or only decreases. A non-monotonic function increases in some places and decreases in others—it oscillates.
This is a red flag because:
Problem 2: Reduces available sequence length
Transformers process sequences of text tokens. If you have a budget of total tokens, and you use tokens for the learned prompt, you can only use tokens for the actual downstream task.
Why this hurts:
Intuition: Imagine you have a fixed amount of working memory. If you "reserve" part of it for learning how to do a task, you have less working memory left to actually solve it.
By eliminating these two approaches, the paper has cleared the way to motivate LoRA:
| Aspect | Adapters | Prefix Tuning | LoRA (coming) |
|---|---|---|---|
| Inference latency | Added latency | No added latency | No added latency |
| Sequence length | Not affected | Reduces available length | Not affected |
| Optimization difficulty | Straightforward | Non-monotonic, hard | (To be shown) |
| Parameter efficiency | Good | Good | (To be shown) |
The paper is setting up the case that LoRA can be the best of both worlds: efficient like adapters and prompt tuning, but without their critical drawbacks.
Adapter layers add inference latency because they must execute sequentially and don't leverage hardware parallelism, especially problematic when batch size = 1 (typical in production)
Prompt optimization reduces usable sequence length (you must reserve tokens for the learned prompt) and has a rough optimization landscape (non-monotonic performance)
Both existing approaches require a trade-off between efficiency and quality, which the authors will argue LoRA avoids
The problem is especially acute for large-scale, production scenarios where latency and throughput matter enormously
This section doesn't just say "other methods are bad"—it explains specifically why they fail at scale, setting up the motivation for LoRA's design choices (which you'll see in the next section).
We describe the simple design of LoRA and its practical benefits. The principles outlined here apply to any dense layers...
Before diving into the math, let's understand what this section is accomplishing and why it matters:
The Core Problem: When we fine-tune a large pre-trained language model (like GPT-3 with 175 billion parameters), we need to update every single parameter. This is expensive in terms of storage, computation, and memory. The key insight here is that we don't actually need to update all parameters independently—the updates themselves might have a special structure.
The Core Insight: The authors hypothesize that when adapting a model to a new task, the changes to the weight matrices (called ) don't need the full complexity of the original matrices. Instead, they can be expressed as the product of two smaller matrices—a "low-rank decomposition."
Why This Matters: If we can represent as a product of two small matrices instead of storing the full matrix, we reduce trainable parameters by orders of magnitude while maintaining performance.
The traditional fine-tuning approach updates a weight matrix as:
where:
LoRA's Key Idea: Instead of learning all of directly, we decompose it as:
where:
The notation means "much smaller than."
Let's think about what's happening with dimensions:
Example: Suppose we have a weight matrix with and (a 12K × 12K matrix, common in GPT-3):
The reason this decomposition works is grounded in a property called rank: the matrix doesn't need to be "full-rank" (maximally complex). Instead, it can be well-approximated by a low-rank matrix.
During training and inference, instead of the original forward pass:
we compute:
Let's break this down step-by-step:
Key Insight: Mathematically, matrix multiplication is associative, so . We can compute this efficiently by first doing the cheaper operation ( with a small matrix), then the expansion ().
The authors use a specific initialization strategy:
This means at the beginning of training:
(the zero matrix, since )
Why This Matters: The model starts as the pure pre-trained model with zero adaptation. This is important for stability—the fine-tuning signal builds up gradually from this solid foundation.
After computing , the output is scaled by a factor:
where is a constant (the authors keep it fixed and don't tune it).
Why This Scaling Matters:
The scaling factor compensates for the effect of changing the rank . Here's the intuition:
In the authors' words: "When optimizing with Adam, tuning is roughly the same as tuning the learning rate." They simply set to the first value of they try and keep it fixed.
Here's a profound observation: LoRA subsumes full fine-tuning as a special case.
If we set the rank equal to the rank of the original weight matrix , then the decomposition can represent any possible update . In other words:
This means:
This is a critical practical advantage. At deployment time, we can combine and into a single matrix:
Then we can:
Contrast with Adapters: Other methods add extra layers that must be computed sequentially, introducing latency. LoRA merges the adaptation directly into the weight matrices, avoiding this overhead entirely.
During training, we only maintain gradients and optimizer states for the small matrices and , not for the large frozen matrix . This is important because:
The section references work by Aghajanyan et al. (2020), which discovered that pre-trained language models have low intrinsic dimension—they can be expressed efficiently in a lower-dimensional subspace.
The authors extend this insight: if the pre-trained model lives in a low-dimensional subspace, then the changes to that model should also be low-dimensional. This is the theoretical motivation for the low-rank assumption.
Think of it this way: A pre-trained model is like a high-resolution image. The model has already learned most of what it needs to know about language. Adapting to a new task is like making small, targeted adjustments to that image. These adjustments don't require full complexity—they can be expressed as a low-rank perturbation.
Let me give you a concrete sense of the savings:
| Component | Dimensions | Parameters |
|---|---|---|
| (frozen) | (not counted) | |
| (trainable) | ||
| (trainable) | ||
| Total trainable | — | |
| Savings | or more |
For GPT-3 (where and ), the reduction is dramatic.
This section establishes:
Everything that follows in the paper—experiments, comparisons with other methods, applications—builds on this foundation.
The equation
describes the forward pass of a neural network layer with LoRA (Low-Rank Adaptation) applied. Let me break down each component:
When you compute and separately and add them coordinate-wise:
The key insight from the paper's algebraic verification is that both terms produce vectors of the same shape (), which then combine element-wise. This is precisely what happens in a neural network: you apply the frozen weights, apply the trainable adaptation, and sum the results.
The brilliance of LoRA lies in how you parameterize the update . Instead of storing a full weight update matrix, you store two much smaller matrices:
With a 100×100 weight matrix:
Full fine-tuning (update entire weight matrix):
LoRA with rank (typical value):
LoRA with rank (full rank, recovers expressiveness):
The paper's hypothesis rests on an empirical observation: weight updates during fine-tuning have low intrinsic rank. This means you don't need to modify all parameters independently—most of the variation can be captured in a much lower-dimensional subspace (dimension ).
Consider this extreme example: a rank-1 matrix:
This 3×3 matrix has rank 1—every row is just a multiple of the first row. You could represent this entire matrix with just 3 + 3 = 6 parameters as the outer product of two vectors:
.
That's the core principle: if has low rank, you save enormous amounts of memory and computation.
From the paper:
We use a random Gaussian initialization for and zero for , so is zero at the beginning of training.
Why? This ensures the model starts exactly as the original pre-trained model (since ), then gradually learns the task adaptation.
The scaling factor adjusts the learning rate dynamically:
| Rank | Scale (if ) | Effect |
|---|---|---|
| 1 | Large updates | |
| 2 | Medium updates | |
| 4 | Smaller updates | |
| 8 | Even smaller |
This inverse relationship prevents gradient explosion as you increase rank. The paper notes that without this scaling, you'd need to retune the learning rate every time you change . With it, you can set once and vary without retuning.
A critical practical advantage: At deployment, you can merge the weights by explicitly computing:
Then use for inference exactly like a standard neural network—zero additional latency. When switching tasks, you subtract and add , a quick operation.
This contrasts with adapter-based methods (which add MLP layers) or prefix-based methods (which increase sequence length), both of which incur runtime overhead.
The equation achieves three things simultaneously:
This is why LoRA became so influential in efficient fine-tuning of large language models.
Showing that rank-deficient matrices have intrinsic low-rank structure




In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable p...
In Section 4.1, we learned the general principle of LoRA: instead of updating all weights in a neural network, we represent weight updates as a low-rank product . Now we need to ask: where exactly do we apply this technique in a real Transformer model?
This section answers that practical question. It explains:
Before diving into LoRA's application, let's establish what weight matrices exist in a Transformer:
A Transformer block contains two main modules:
Each weight matrix performs a linear transformation. For example, the query projection transforms the input using matrix multiplication.
Here's what the paper does:
"We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules both for simplicity and parameter-efficiency."
Let's unpack this:
Attention Matrices: The paper applies LoRA to , , and (and sometimes ). These matrices are each of dimension:
where is the hidden dimension size (for GPT-3, this is 12,288).
MLP Modules: The paper does not apply LoRA here—these weights stay frozen. This is a practical choice to reduce the number of trainable parameters even further.
The paper notes that the effect of adapting different attention weight matrices is studied later (Section 7.1), suggesting this is an empirical design decision. The key insight is:
Let's take as a concrete example. Originally:
where:
With LoRA applied, we modify this to:
where:
Key point: The term is the trainable low-rank update, while is frozen.
The paper provides concrete calculations. Let's work through the key insight:
When training a large model with the Adam optimizer, we must store:
For full fine-tuning of GPT-3 175B:
For LoRA fine-tuning with only query and value projection adapted:
Result: reduction (the "3 times" mentioned in the abstract)
For deploying fine-tuned models on different tasks:
Full fine-tuning approach:
LoRA approach:
Result: 10,000× reduction in additional storage per task
This comes from:
compared to:
Ratio:
### Training Speed The paper reports: **25% speedup on GPT-3 175B training**Why? Because we don't compute gradients for the vast majority of parameters:
with the scaling factor applied to .
Section 4.2 now tells us:
This is the bridge between the general principle and the actual implementation, which is crucial for practitioners trying to use LoRA.
| Metric | Full Fine-tuning | LoRA | Improvement |
|---|---|---|---|
| Trainable Parameters | 175B | 0.001B (0.01%) | 10,000× fewer |
| Training VRAM | 1.2 TB | 350 GB | 3× less |
| Task Checkpoint Size | 350 GB | 35 MB | 10,000× smaller |
| Training Speed | Baseline | +25% | 25% faster |
| Inference Latency | None | None (when merged) | Same |
| Batch Multiple Tasks | Yes | Limited | Depends on approach |
Section 4.2 demonstrates that LoRA's theoretical elegance (low-rank updates) translates into real, measurable engineering benefits. The main trade-off is the loss of flexibility in batching multiple tasks, but this is a minor limitation compared to the dramatic efficiency gains. For most production scenarios where models are deployed for a fixed set of tasks, the benefits far outweigh the costs.
We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT-2 ...
Before the researchers can claim that LoRA actually works, they need to show the evidence. This section lays out their experimental plan: which models they tested, which tasks they used, and which competing methods they compared against.
Think of this like a science experiment—the authors are saying: "Here's what we're testing, here's what we're comparing it to, and here's how we'll measure success." This transparency is crucial for readers to evaluate whether the claims made in the abstract (that LoRA matches or beats full fine-tuning while using far fewer parameters) actually hold up.
The authors test LoRA across a range of models and tasks:
This progression is strategic: they show LoRA works on smaller models first, then scale up to the enormous GPT-3 model where the parameter savings matter most.
NLU (Natural Language Understanding):
NLG (Natural Language Generation):
This mix of tasks is important because it shows LoRA works across different types of language problems—not just understanding, but also generation.
This is the critical section for understanding what makes LoRA special. The authors compare LoRA to several existing approaches. Let me break down each competing method and explain the parameter count formula for each.
Here we encounter our first parameter count formula. Let me break it down:
What each term means:
What this is doing geometrically: Imagine you have an input sequence. The method inserts special tokens at the beginning with learnable word embeddings. Each token has dimensions. If you add total tokens, you get parameters to train.
The tradeoff:
What changed:
Geometric intuition: If PreEmbed only learns prefixes at the input level, PreLayer learns them throughout the entire network depth. This is more flexible but more expensive.
This is more complex. There are three variants (AdapterH, AdapterL, AdapterP), but they all use this parameter formula:
This is intimidating, so let me decompose it:
Breaking down the first term:
Recall from Section 4.1 that adapters insert two small neural networks (MLP layers) into each Transformer block. Within each adapter:
So each adapter block has: parameters.
Multiply by (number of blocks with adapters) and you get the first term.
Breaking down the second term:
Some adapter variants (specifically AdapterH and AdapterL) add extra LayerNorm operations:
Key insight: Adapters add training parameters in each layer, and critically, they introduce sequential computation during inference (from Section 3), which increases latency.
What each term means:
Why this formula is simpler: Recall equation (3) from Section 4.1:
For a single weight matrix :
When we apply this to attention (which treats all head projections as a single matrix), and we do it for both query and value projections:
This is vastly smaller than the adapter formula because there's no expansion/compression overhead and no LayerNorm parameters.
Here's how these compare intuitively when is small (e.g., or ):
| Method | Parameter Growth | Key Issue |
|---|---|---|
| Full Fine-Tuning | Billions | Expensive to deploy |
| BitFit | Tiny | Too limited in expressiveness |
| PreEmbed | Small | Reduces usable sequence length |
| PreLayer | Medium (grows with ) | Reduces usable sequence length |
| Adapter | Medium (grows with ) | Adds inference latency |
| LoRA | Small (grows with ) | No inference latency; competitive performance |
The authors are being scientifically rigorous. By showing performance against multiple baselines across multiple models and tasks, they're:
This experimental design—testing across different model sizes, different task types, and different baseline methods—is what allows them to make the broad claims in the abstract.
RoBERTa (Liu et al., 2019) optimized the pre-training recipe originally proposed in BERT and boosted the latter's task p...
This section presents the experimental validation of LoRA on real-world models and benchmarks. After explaining the theoretical framework in earlier sections, the authors now answer the crucial question: Does LoRA actually work in practice? Specifically, they test whether LoRA can match the performance of full fine-tuning while using dramatically fewer trainable parameters. This is important because it demonstrates that their low-rank hypothesis isn't just theoretically sound—it delivers practical value.
The choice of models (RoBERTa and DeBERTa) is strategic: these are increasingly sophisticated variants of BERT, so testing on them shows LoRA scales to better models. The benchmark is GLUE (General Language Understanding Evaluation), which is the standard test suite in NLP for evaluating language understanding.
RoBERTa is an improved version of BERT that uses better pre-training techniques. The paper uses two sizes:
Think of these as "small" and "medium" language models by modern standards.
GLUE is a benchmark consisting of 9 different NLP tasks. Each task measures a different aspect of language understanding:
| Task | Type | Measurement |
|---|---|---|
| MNLI | Natural language inference | Overall accuracy |
| CoLA | Grammatical acceptability | Matthew's correlation |
| STS-B | Semantic textual similarity | Pearson correlation |
| Other tasks | Various | Accuracy |
Key point: A higher score is better for all metrics. The paper reports different metrics for different tasks because some tasks have natural metrics (like correlation for similarity tasks vs. accuracy for classification tasks).
The authors keep certain experimental conditions fixed to ensure fair comparison:
These controls are important because they ensure that differences in performance come from the adaptation method (LoRA vs. alternatives), not from different training hyperparameters.
DeBERTa is a more recent and more powerful variant than BERT or RoBERTa. The "XXL" version they test has:
This is important for testing LoRA because larger models present a greater challenge: more parameters mean more potential for "full-rank" updates during fine-tuning (in other words, updates that use the full capacity of the weight matrices). If LoRA works well on DeBERTa XXL despite its massive size, it suggests the low-rank hypothesis is fundamentally sound.
The paper states:
"LoRA with only 4.7M trainable parameters matches or exceeds the full fine-tuning baseline across all tasks."
Let's unpack what this means mathematically.
This means LoRA uses only about 0.31% of the parameters that full fine-tuning requires. Yet it matches or exceeds full fine-tuning's performance.
Recall from Section 5 (the Overview section), the number of trainable parameters for LoRA is:
For DeBERTa XXL, working backwards:
This gives , confirming they use a very small rank value.
--- ## Part 3: Comparing to Other Baselines Looking back at Section 5's parameter count comparison, recall the formulas: | Method | Trainable Parameters | |--------|----------------------| | LoRA | $2 \times \hat{L}_{\text{LoRA}} \times d_{\text{model}} \times r$ | | Adapter | $\hat{L}_{\text{Adpt}} \times (2 \times d_{\text{model}} \times r + r + d_{\text{model}}) + 2 \times \hat{L}_{\text{LN}} \times d_{\text{model}}$ | | Prefix-layer | $L \times d_{\text{model}} \times (l_p + l_i)$ | **LoRA's formula is much simpler and more efficient** because: 1. It only adds parameters proportional to the rank $r$ and model width 2. It doesn't require extra layers (like Adapter) or special architectural modifications (like Prefix-layer) The key insight: **simplicity in structure leads to efficiency in parameters**. --- ## Part 4: Why These Results Matter ### The Empirical Validation of the Low-Rank Hypothesis Recall from Section 4.1, the core assumption is: > "The updates to the weights also have a low 'intrinsic rank' during adaptation"In mathematical terms: when adapting a pre-trained weight matrix to a new task, the change doesn't need all dimensions. The paper hypothesizes that can be well-approximated by:
where and with .
These experimental results validate this hypothesis: if the updates truly required full rank, then using rank- approximations would significantly hurt performance. Instead, performance matches or exceeds full fine-tuning.
When the paper says LoRA "matches or exceeds" full fine-tuning, look at Table 2:
This rules out the alternative explanation that "LoRA is a useful approximation that sacrifices a little accuracy for parameter efficiency." Instead, it shows LoRA is fundamentally as effective as full fine-tuning while using far fewer parameters.
Recall from Section 4.2:
These improvements follow from the parameter reduction. If you have fewer trainable parameters:
| Model | Size | Method | Trainable Params | Performance |
|---|---|---|---|---|
| RoBERTa base | 125M | LoRA | ~4M | Matches FT |
| RoBERTa large | 355M | LoRA | ~7M | Matches FT |
| DeBERTa XXL | 1.5B | LoRA | 4.7M | Matches or exceeds FT |
The pattern is clear: as models get larger, LoRA becomes even more valuable (0.31% of parameters for 1.5B model), yet performance remains competitive or superior.
This section provides empirical evidence that the theoretical framework of LoRA actually works. By constraining weight updates to low-rank matrices, the authors achieve:
This is the "proof" that validates LoRA as a practical solution to the full fine-tuning problem.
Having shown that LoRA can be a competitive alternative to full fine-tuning on NLU, we evaluate whether LoRA still preva...
This section is the grand finale of the LoRA paper's empirical evaluation. The authors have progressively tested LoRA on increasingly complex and larger models:
Why does this matter? The paper's central claim is that LoRA can match full fine-tuning while using dramatically fewer parameters. Testing on GPT-3 175B is crucial because:
What's being tested:
The authors move from understanding tasks (NLU) to generation tasks (NLG). GPT-2 is a generative model, so this tests whether LoRA works beyond just classification and understanding problems.
The Setup:
Key Results (from Table 3):
LoRA outperforms several baselines despite having comparable or fewer trainable parameters. The metrics shown (BLEU, METEOR, ROUGE-L) are standard generation quality metrics where higher is better.
Why this matters: NLG is harder than NLU in some ways because the model must generate coherent sequences, not just classify them. If LoRA works here, it shows the approach isn't limited to discriminative tasks.
Why GPT-3 175B is special:
GPT-3 has 175 billion parameters. Let me put this in perspective:
That's roughly 500 times larger than RoBERTa. At this scale, storing optimizer states during full fine-tuning becomes prohibitively expensive.
What LoRA achieves on GPT-3:
According to the abstract:
To understand these numbers mathematically, recall from Section 4.2 that LoRA's trainable parameters are:
Where:
Compare this to full fine-tuning, where all parameters are trainable. The ratio is approximately:
Key Results (from Table 4):
Three datasets are evaluated:
The crucial finding: LoRA matches or exceeds full fine-tuning on all three datasets despite having orders of magnitude fewer trainable parameters.
The section makes an important empirical discovery that deserves careful attention:
"Note that not all methods benefit monotonically from having more trainable parameters."
This is surprising! In traditional machine learning, we often assume "more parameters = better performance" (up to overfitting). But here we see something different.
What's happening with prefix tuning methods:
Why? The authors hypothesize:
But adding too many special tokens shifts the input distribution away from the pre-training distribution. Mathematically, if is the input and is the distribution the model learned from:
When diverges too far from , the model's internal representations become unreliable because it wasn't trained to handle such inputs.
LoRA doesn't have this problem because it works by learning low-rank adjustments to existing weight matrices, not by modifying the input space. The model's internal mechanisms remain grounded in its pre-training distribution.
[Figure Figure 2 shows: GPT-3 175B validation accuracy vs. number of trainable parameters on WikiSQL and MNLI-matched]
What this plot reveals:
The x-axis is logarithmic (spanning trainable parameters). The y-axis is validation accuracy. You can see:
Mathematically, this shows that LoRA exhibits better parameter efficiency, defined as:
| Model | Task Type | Key Finding |
|---|---|---|
| GPT-2 | NLG (generation) | LoRA outperforms baselines with comparable/fewer parameters |
| GPT-3 175B | Mixed (NLU + NLG) | LoRA matches/exceeds full fine-tuning with 10,000× fewer parameters |
The broader implication: The results demonstrate that LoRA is not just a clever trick for smaller models—it's a genuinely effective approach for adapting the largest language models that exist. This has massive practical implications for deployment, cost, and accessibility of large model fine-tuning.
Transformer Language Models: Transformer (Vaswani et al., 2017) is a sequence-to-sequence architecture that makes heavy ...
This section situates LoRA within the broader landscape of machine learning research. Rather than explaining LoRA's mechanics (which previous sections covered), Section 6 answers: Where does LoRA fit in the history of ideas? What existing work inspired it? How does it differ from alternatives?
This matters because it helps you understand:
The section has four main threads that trace the intellectual lineage of the paper. Let me walk through each.
What's the main point? This subsection establishes that LoRA is designed for a specific type of model architecture and training paradigm.
The Transformer is a neural network architecture built on self-attention mechanisms. Here's what you need to know:
The key insight: Transformers became dominant because they're effective at capturing long-range dependencies in language.
The paper identifies a standard two-phase approach:
Pretraining: Train on massive amounts of general text data
Fine-tuning: Adapt the pretrained model to specific tasks
Why this matters for LoRA: As models get larger (GPT-3 has 175 billion parameters!), full fine-tuning becomes impractical. LoRA is designed to make this second phase more efficient.
What's the tension here? This subsection explains the practical problem that motivated LoRA.
Prompt Engineering (a.k.a. "prompt hacking"):
Fine-Tuning (the traditional approach):
The authors are saying: Fine-tuning works well, but it's prohibitively expensive. Can we get fine-tuning-level performance without the cost?
This is where LoRA enters the conversation.
What's the main point? This subsection reviews existing methods that try to solve the same problem LoRA addresses. This is crucial for understanding LoRA's novelty.
The concept of adapters was proposed before LoRA:
Basic idea: Insert small trainable modules between existing layers rather than updating all weights
Visual structure:
[Layer N] → [Adapter] → [Layer N+1]
Mathematical structure (from previous section):
where:
Key constraint: The adapters create a "bottleneck" — information must pass through a narrow intermediate representation (dimension ), which is much smaller than . This forces the adapter to learn a low-rank approximation of the weight updates.
The problem: Adapters add inference latency — the model must compute both the original layers AND the adapter layers during deployment.
Another strategy: modify the input representations.
Basic idea: Add special trainable tokens to the input sequence
Two variants:
Prefix-Embedding Tuning: Insert trainable word embeddings among input tokens
Prefix-Layer Tuning: Learn activations after every Transformer layer
The limitation the paper identifies: These methods don't scale well. The experiments found:
The authors hypothesize: More special tokens shift the input distribution away from what the model saw during pretraining, degrading performance.
Compared to adapters:
This is actually simpler! And crucially:
The paper explicitly calls this out as a "key functional difference."
What's the main point? This subsection provides theoretical grounding for why low-rank adaptation even makes sense.
Observation 1: Low-rank structure is ubiquitous
A matrix is rank- if it can be written as:
where:
Why this matters: Instead of storing numbers, you only store numbers. If , this is much smaller.
Example: For a weight matrix in GPT-3:
Observation 2: Many researchers have exploited low-rank structure
The authors acknowledge:
This is what LoRA does: it implicitly assumes that the weight changes needed to adapt a model to a new task have low-rank structure.
There's an intuitive argument here:
This is related to a principle in machine learning: The manifold hypothesis — the data and model behaviors often lie on low-dimensional manifolds within high-dimensional spaces.
Here's the narrative arc:
Transformers become dominant
↓
Need to adapt them to specific tasks
↓
Full fine-tuning is too expensive
↓
Try alternatives (prompt engineering, adapters, prefix tuning)
↓
Problem: Adapters have latency, prefix methods don't scale
↓
Insight: Weight updates for adaptation likely have low-rank structure
↓
LoRA: Merge low-rank structure with a design that has no inference latency
The genius of LoRA is that it borrows the bottleneck structure idea from adapters (which naturally enforces low-rank learning) but does it in a way that:
Section 6 establishes that LoRA isn't invented in a vacuum—it's the natural evolution of existing ideas about parameter-efficient adaptation, grounded in the empirical observation that neural network updates have low-rank properties. The novelty is in the particular combination and clever implementation rather than entirely new concepts.
Given the empirical advantage of LoRA, we hope to further explain the properties of the low-rank adaptation learned from...
The authors have just shown that LoRA works remarkably well in practice—it matches or beats full fine-tuning while training 10,000× fewer parameters. But they haven't yet explained why it works so well. This section launches an empirical investigation into the fundamental properties of low-rank adaptation.
Think of it this way: they've discovered a powerful tool, and now they want to understand the underlying principles that make it effective. This understanding will help us:
The authors identify three concrete questions they want to answer empirically:
In a Transformer model (particularly GPT-3 with 175B parameters), there are many weight matrices distributed across many layers. We can't—or don't want to—adapt all of them due to computational constraints. The question is: which subset of weight matrices should we prioritize to get the best downstream task performance within our parameter budget?
This is a resource allocation problem. If we only have a fixed "parameter budget" (say, 4.7M trainable parameters like in the DeBERTa example), where should we "spend" those parameters?
Recall from the paper's core idea: LoRA replaces weight updates with a low-rank decomposition. Instead of learning the full weight update , we learn two smaller matrices whose product approximates .
But here's the key question: does the actual optimal weight update really have low rank? Or are we just making it work despite this constraint being artificial?
If genuinely has low rank, this would validate the entire approach theoretically. If it does, what rank should we use in practice to balance performance and computational savings?
This is about understanding the geometry of fine-tuning. Specifically:
This helps answer: what is the model actually learning during adaptation?
The authors deliberately choose to study GPT-3 175B for this analysis, and their reasoning is important:
GPT-3 represents their largest empirical success—the most extreme compression. If they can understand why LoRA works here, they've likely understood the core principles. Additionally, the massive parameter reduction (10,000×) makes the analysis most compelling: clearly something about model updates must be fundamentally low-rank.
In the LoRA framework (from earlier sections), instead of learning directly, we learn two matrices and such that:
where:
The total number of parameters in this decomposition is:
This is much smaller than the original parameters when .
A matrix is rank-deficient if its actual rank is much smaller than its potential rank. For a matrix in , the maximum rank is .
Key insight: If is rank-deficient, it means the "true" weight update can be represented using far fewer parameters than the full matrix. This would explain why LoRA works—we're not forcing a low-rank constraint on inherently high-rank data; rather, we're exploiting a natural property of how language models adapt.
What the authors are about to do is systematically ablate and analyze the LoRA approach:
Ablation studies: Try adapting different subsets of weight matrices (e.g., only query matrices, only value matrices, only feed-forward layers, etc.) and measure performance to see which choices matter most.
Rank analysis: For the weight matrices they adapt, examine the actual rank of by computing its singular value decomposition (SVD):
$ \Delta W = U \Sigma V^T
\Sigma$ is a diagonal matrix of singular values in descending order. They can observe how quickly these singular values decay—rapid decay indicates true rank-deficiency.
Understanding these properties tells us something fundamental about how large language models adapt:
Generalization insight: If weight updates are rank-deficient, this suggests that fine-tuning on new tasks doesn't require restructuring the entire model—it only requires modest adjustments in certain directions.
Model interpretability: The correlation between and could reveal whether fine-tuning reinforces certain learned features or creates entirely new pathways.
Efficiency principles: If we understand which weight matrices are most important to adapt, we can design even more efficient adaptation schemes in the future.
Section 7 is the paper's "detective work." The authors have shown LoRA works, and now they want to understand why. They'll investigate three interconnected questions about which weights to adapt, whether those weights genuinely have low rank, and how the learned updates relate to the original model weights.
By focusing on GPT-3 175B—where the parameter reduction is most dramatic—they maximize the signal for understanding these properties. The subsequent sections will present the empirical findings from this investigation.
Given a limited parameter budget, which types of weights should we adapt with LoRA to obtain the best performance on dow...
Before diving into the mathematics, let's understand the core question: Given limited computational resources, where should we apply LoRA for maximum performance?
Recall from the abstract that LoRA works by:
But here's the practical problem: not all weight matrices are equally important to adapt. The Transformer architecture has many types of weights—in the self-attention module alone, there are weights for queries (), keys (), values (), and output projections (). If you have a fixed "budget" of trainable parameters, you need to decide: Which weights should you spend your budget on?
This section answers that question empirically through systematic experiments.
The researchers imposed a constraint: 18 million trainable parameters (about 35 MB in FP16 floating-point format) on GPT-3 175B.
Why 18M? This is a realistic constraint when:
Recall that in LoRA, instead of directly training (the weight update), we decompose it as a product of two smaller matrices:
where:
Parameter count for one weight matrix: If you adapt a weight matrix of shape , you introduce:
For GPT-3 with 96 Transformer layers, the experiment considers two scenarios:
Adapting ONE type of attention weight (e.g., just ):
Adapting TWO types of attention weights (e.g., both and ):
The key insight: the same parameter budget can be distributed as (fewer layers × higher rank) OR (more layers × lower rank).
In the Transformer self-attention mechanism, each attention head computes:
where , , (queries, keys, values) come from linear projections:
So "different types of attention weights" means: , , , and .
Table 5 tests different combinations:
All use the same 18M parameter budget to ensure fair comparison.
The results show:
"Note that putting all the parameters in or results in significantly lower performance, while adapting both and yields the best result."
Let's unpack why this matters:
What this means: If you use your parameter budget to make very large (rank 8), you get worse performance than spreading the budget across and (each rank 4).
This seems counterintuitive at first! You might think: "More capacity (higher rank) should be better."
But the paper gives us the explanation:
"This suggests that even a rank of four captures enough information in such that it is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank."
The mathematical intuition:
Let's denote the optimal weight update needed for a downstream task as (the "true" update). When you use LoRA with rank , you can represent any matrix of rank at most .
Scenario 1: High rank on one matrix
Scenario 2: Lower rank on multiple matrices
This observation connects back to the paper's broader theme: the weight updates are rank-deficient (they don't need full dimensionality to be effective).
If were truly rank-8, then rank 4 should fail badly. The fact that rank 4 works well suggests that:
This section establishes a practical design principle for LoRA:
Not all weight matrices are equally important — In Transformer self-attention, and are more critical for adaptation than or
Rank doesn't scale linearly with performance — A rank-4 update across two matrices beats a rank-8 update on one matrix
Low-rank assumption is validated — The fact that works as well as (even with half the parameters) empirically validates the paper's core assumption that adaptation matrices are inherently low-rank
Practical guidance — When implementing LoRA on your own models, focus on and in the attention modules first
| Concept | Explanation |
|---|---|
| Parameter Budget | Fixed limit (18M) forces trade-off between which matrices to adapt and what rank to use |
| Rank vs. Coverage | Spreading moderate rank across multiple matrices beats high rank on few matrices |
| Empirical Finding | with rank 4 > alone with rank 8 |
| Why It Works | Different attention weight components have rank-deficient updates; better to capture all of them partially than some of them fully |
We turn our attention to the effect of rank $r$ on model performance. We adapt $\{W_q, W_v\}$, $\{W_q, W_k, W_v, W_o\}$,...
The central mystery of LoRA is: How small can we actually make the rank before the method stops working well?
Think of it this way: if we can use instead of , that's an 8x reduction in trainable parameters. This section answers this crucial practical question by investigating whether the weight updates that the model learns during adaptation are inherently low-rank (i.e., whether they can be well-approximated using a small rank).
The authors will show something surprising: very small ranks work surprisingly well, and this isn't just lucky—the adaptation matrices actually have an intrinsically low-rank structure.
The authors test three different configurations:
For each configuration, they measure how accuracy changes as varies. Table 6 shows the results.
The results are striking:
This empirically demonstrates that the matrices and contain one or a few dominant directions that matter for task adaptation. Adding more rank dimensions doesn't substantially improve performance, suggesting those extra dimensions don't encode useful information.
Now the authors dig deeper: Are the subspaces learned by different ranks actually the same, or are they discovering different information? This is where the mathematics becomes rigorous.
To compare subspaces mathematically, the authors use a subspace similarity measure based on the Grassmann distance:
Let me break down every element of this formula:
What each variable means:
Understanding the notation:
What this formula computes geometrically:
Think of SVD as a change of coordinates that rotates your data. The columns of point in the directions of maximum variance. When you compute , you're finding the projection of the top directions from onto the top directions of .
The Frobenius norm of this projection tells you how much those subspaces overlap. Dividing by normalizes this to a range of where:
What Figure 3 shows: The heatmaps display values between pairs of ranks (typically vs ). The axes represent the dimension indices and .
Key pattern observed:
Near the origin (top-left corner): High similarity values (bright color)
Away from the origin: Low similarity values (dark color)
The authors' crucial finding:
"Directions corresponding to the top singular vector overlap significantly between and , while others do not."
This is mathematically precise: the top-1 singular vector (first direction) of appears in both and with similarity > 0.5. This explains why works so well empirically—you're capturing the single most important direction of adaptation.
To ensure this isn't an artifact of one particular training run, the authors train the same model twice with different random initializations (random seeds) but the same rank ().
Left and Middle panels: Subspace similarity between two independent training runs
: Shows a checkerboard-like pattern with some alignment near the diagonal
: Shows stronger diagonal dominance (more concentrated overlap)
Right panel: Comparison with random Gaussian matrices (noise baseline)
This is a convergence argument: if two different random initializations both learn the same low-rank subspace, that subspace represents a fundamental property of the task, not noise. It suggests the adaptation has discovered what information is actually necessary for the downstream task.
The authors propose that adaptation matrices have an intrinsic rank: a small number of fundamental directions that capture task-relevant information.
What does "intrinsic rank" mean?
Why does this matter for LoRA?
Given a parameter budget, you want to choose large enough to capture these intrinsic directions but no larger. The analysis shows:
| Level | Evidence | What It Shows |
|---|---|---|
| Empirical (Table 6) | Small achieves good accuracy | The model doesn't need high rank to adapt |
| Subspace Analysis (Figure 3) | Top directions overlap across ranks | Consistent, fundamental structure exists |
| Replicability (Figure 4) | Different seeds learn same structure | This structure is intrinsic, not accidental |
Together, these analyses provide strong evidence that language model adaptation to downstream tasks is fundamentally low-rank, justifying LoRA's design choice to use small ranks with practical parameter efficiency.
Now let me build the complete explanation:
The equation you're analyzing measures subspace similarity between two different adaptation matrices in LoRA:
and : These are the learned adaptation matrices (weight updates) from LoRA training with different ranks—one using rank 8 and one using rank 64. These matrices capture how the pre-trained model's weights should be modified for a specific task.
and : The right-singular unitary matrices obtained from the Singular Value Decomposition (SVD) of the respective adaptation matrices. The SVD decomposes a matrix as , where:
The matrices capture the directions along which the weight updates are most significant.
and : These represent selecting only the first (top) and singular vectors from each matrix respectively. These are the "most important" directions, ranked by their corresponding singular values.
: This is the matrix product of the transpose of the first matrix with the second. This product captures how much the most important directions from align with the most important directions from .
: The squared Frobenius norm. For a matrix , the Frobenius norm is , which is essentially the Euclidean norm of all matrix elements flattened into a vector. The squared version avoids the square root.
: The normalization factor that ensures the metric is bounded in . When comparing vectors with vectors, we normalize by the smaller dimension to account for the maximum possible overlap.
Let me illustrate with a concrete SVD example. When we decompose a matrix:

For the matrix shown, the matrix has 2 columns (left singular vectors), representing two fundamental directions in the output space.
The denominator is crucial. Consider comparing the top and singular vectors:
: Perfect subspace overlap—the top directions from rank 8 exactly match the top directions from rank 64. This suggests both matrices learn the same most important directions.
: Moderate overlap (for the case where ). This is significant—the paper notes that sharing a 1-dimensional subspace with similarity > 0.5 explains why rank performs well.
: Complete separation—the subspaces are orthogonal. The top directions in one rank assignment are entirely different from the other.
The critical finding from Figure 3 in the paper shows:
High overlap at low dimensions: The top singular vectors (corresponding to ) have high similarity, often .
Divergence at higher dimensions: As and increase (moving into less important directions), the similarity drops significantly.
This explains low-rank sufficiency: Since most of the meaningful subspace overlap occurs in the first few directions, a small rank (like or ) captures the essential adaptation needed, making higher ranks () redundant.
The key mathematical property is that for orthogonal matrices :
When you form where both come from SVD:
This metric is closely related to the Grassmann distance, which measures distance between subspaces in a principled geometric way. The paper uses this normalized version for interpretability.
Visualizing the normalization factor in the subspace similarity metric


Computing SVD to understand what U matrices look like in a concrete example


We further investigate the relationship between $\Delta W$ and $W$. In particular, does $\Delta W$ highly correlate with...
This section asks a fundamental question: What is the relationship between the weight updates we learn (ΔW) and the original pre-trained weights (W)?
Specifically, the authors want to understand:
This matters because it reveals the mechanism of adaptation: Are we simply tweaking what's already there, or are we discovering entirely new features?
The authors first ask: Is ΔW mostly contained in the top singular directions of W?
Let me unpack what this means:
When we perform singular value decomposition (SVD) on a matrix , we're decomposing it as:
where:
The top singular directions refer to the directions corresponding to the largest singular values . These represent the directions in which the matrix has the most "importance" or "energy."
Intuition: If a matrix emphasizes certain directions heavily (large singular values), those are the "important" directions it uses.
If ΔW were just following what W already emphasizes, we'd expect ΔW to align with W's top singular directions. But if ΔW is discovering different features, it would align with W's less-emphasized directions.
The authors measure this by projecting W onto the subspace spanned by ΔW's singular vectors.
Since ΔW has rank , we can decompose it as:
where and are the low-rank factors. The SVD of ΔW gives us:
where:
Now we compute:
What does this mean geometrically?
The result is an matrix showing what portion of "lives in" the same directions as ΔW.
The authors compute the Frobenius norm:
and compare it to:
The Frobenius norm of a matrix is defined as:
It measures the "total size" of a matrix—roughly, the square root of the sum of all squared entries.
What the ratio tells us:
This ratio ranges from 0 to 1 and tells us: What fraction of W's total energy/magnitude lies in the directions that ΔW uses?
The authors compute three different ways:
Looking at the table, three conclusions emerge:
The values in the ΔW row are much larger than the random row. This means ΔW isn't learning arbitrary features—it's related to what W already knows.
Interpretation: The model reuses features it learned during pre-training rather than inventing completely new ones.
The ΔW row has smaller values than the W row! This means:
Interpretation: ΔW doesn't repeat W's most important directions. Instead, it amplifies secondary features that W learned but downplayed.
The text states: for
This is computed as:
What this means:
In the subspace spanned by ΔW's directions, W has magnitude 6.91 (out of a total magnitude of ~0.32 for random, meaning ΔW is finding structure). But ΔW then amplifies this by a factor of about 21.5×.
Combining these findings reveals how LoRA works:
Mathematical interpretation:
This is not random noise; it's a selective amplification:
This section explains why LoRA is so effective despite its small rank:
This is a profound insight: Large language models are already very powerful after pre-training. Fine-tuning is primarily about prioritization, not creation.
Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switch...
This is the paper's conclusion, which serves two main purposes:
Summarizes the key achievement: LoRA solves a critical practical problem — fine-tuning enormous language models (like GPT-3 with 175 billion parameters) is computationally prohibitive, so the authors propose an efficient alternative.
Outlines open questions and future directions: Rather than claiming they've solved everything, the authors honestly discuss limitations and promising avenues for future research.
This is important because it positions LoRA not as a complete solution, but as a stepping stone that enables both practical improvements AND deeper scientific understanding of how language models adapt.
"Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switching cost for hosting independent instances for different tasks."
Traditional fine-tuning requires:
The cost is enormous:
LoRA enables:
The authors emphasize three critical properties:
Efficiency:
Quality preservation:
Practical deployment:
"While we focused on Transformer language models, the proposed principles are generally applicable to any neural networks with dense layers."
LoRA isn't specific to language models. The core idea is:
For any weight matrix in a neural network, replace it with:
where:
This applies to:
The principle is model-agnostic: whenever you have a large weight matrix, you can apply low-rank decomposition.
The authors identify four important open questions:
"LoRA can be combined with other efficient adaptation methods, potentially providing orthogonal improvement."
What does "orthogonal" mean here?
In linear algebra, two vectors are orthogonal if they're perpendicular — they don't overlap. Here, "orthogonal improvement" means: improvements from different methods don't cancel out, they add together.
Examples of complementary methods:
You could combine LoRA with these — e.g., LoRA + quantization might be even more efficient than either alone.
"The mechanism behind fine-tuning or LoRA is far from clear – how are features learned during pre-training transformed to do well on downstream tasks?"
Why is this important?
Currently, deep learning is somewhat of a "black box." We know fine-tuning works empirically, but we don't fully understand:
Why does LoRA help answer this?
"We mostly depend on heuristics to select the weight matrices to apply LoRA to. Are there more principled ways to do it?"
What's the problem?
In Section 7.1, the authors tested which attention weights to adapt:
They used empirical testing (running experiments and measuring accuracy), which is expensive.
What would be "principled"?
A theoretical framework that could predict which weights are important without running expensive experiments. For example:
"Finally, the rank-deficiency of suggests that could be rank-deficient as well, which can also be a source of inspiration for future works."
What does "rank-deficient" mean?
For a matrix of shape :
The observation:
Why is this significant?
If pre-trained models have inherent rank-deficiency, it means:
This opens research into:
This conclusion demonstrates scientific maturity:
| Aspect | Significance |
|---|---|
| Practical Impact | LoRA solves a real problem: making large model fine-tuning affordable |
| Honesty about limitations | Authors acknowledge they don't have all answers |
| Enables future work | LoRA's simplicity and efficiency make it a good foundation for further research |
| Bridges practice and theory | Points toward deeper understanding of how neural networks adapt |
The paper doesn't claim LoRA is perfect — rather, it's a practical tool that simultaneously opens new research questions about model adaptation, compression, and the fundamental properties of large language models.
Few-shot learning, or prompt engineering, is very advantageous when we only have a handful of training samples. However,...
This section addresses a fundamental question: Why is fine-tuning necessary at all? The authors are making a crucial argument for why LoRA (and fine-tuning in general) matters.
In the broader context of the paper, LoRA is proposed as an efficient way to adapt pre-trained models. But before explaining how to adapt efficiently, the authors need to establish that adaptation is actually necessary—that we can't just use clever prompting tricks instead of updating model parameters.
This is important because if few-shot prompting worked well enough, there would be no need for LoRA, and the entire paper's contribution would be unnecessary. So this section provides empirical justification for the whole approach.
Few-shot learning means showing the model only a handful of examples (typically 1-10 examples) in the prompt itself, without updating any model parameters.
For example, with GPT-3, you might write:
Example 1: Input: "This movie is great!" → Output: "Positive"
Example 2: Input: "I hated it." → Output: "Negative"
Now classify: "This film was amazing!" → Output: ?
Advantages:
Fine-tuning means taking a pre-trained model and updating all (or some) of its parameters using thousands of labeled examples from your specific task.
Mathematically, in standard fine-tuning, we update the weight matrices by minimizing a loss function through gradient descent:
where:
This process actually modifies the model's internal parameters to specialize in your specific task.
The section references Table 8, which compares these two approaches on GPT-3:
| Method | Performance |
|---|---|
| Few-shot learning | Lower accuracy |
| Fine-tuning | Higher accuracy |
The key claim is: fine-tuning drastically outperforms few-shot learning on datasets both large and small.
On small datasets: You might think few-shot learning would suffice with limited examples. But Table 8 shows fine-tuning is still dramatically better.
On large datasets: When you have thousands of training examples, the gap widens even more.
This empirical observation justifies the entire paper's premise: we need to update model parameters to achieve good performance on downstream tasks.
When we fine-tune, we're computing:
This represents the cumulative change to the weight matrices after training on task-specific data.
From earlier sections of the paper (section 7.3), we learned that:
Few-shot learning only provides the model with examples through the input context window, but doesn't allow it to reorganize its internal representations. The model's weights stay frozen, so it can only apply the general knowledge from pre-training.
In contrast, fine-tuning lets the model reorganize these internal representations—amplifying features relevant to your specific task. Think of it like:
This section sets up the motivation for LoRA:
LoRA provides the benefits of fine-tuning (parameter updates to ) while dramatically reducing computational cost by using low-rank decomposition. Rather than updating full weight matrices , LoRA only updates small rank-decomposition matrices that produce .
From the abstract: LoRA reduces trainable parameters by 10,000x and memory by 3x compared to full fine-tuning.
| Aspect | Few-Shot | Fine-Tuning | LoRA |
|---|---|---|---|
| Requires training? | No | Yes | Yes |
| Parameter updates? | No | All parameters | Low-rank approximation |
| Performance | Lower | Higher | Same as fine-tuning |
| Computational cost | Low | Very high | Much lower |
| Inference latency | None | None | None |
This section makes a simple but powerful point: parameter updates matter, a lot. The data shows there's no shortcut—if you want good performance on a specific task, you need to actually train on that task's data. LoRA's contribution is making that training feasible at scale.
Adapter layers are external modules added to a pre-trained model in a sequential manner, whereas our proposal, LoRA, can...
This appendix section makes an important practical comparison: while LoRA (the main contribution of the paper) and adapter layers are both parameter-efficient fine-tuning methods, they have fundamentally different architectural designs that lead to very different runtime performance characteristics.
The key insight is this: When you deploy a model to serve real users (online inference), speed matters as much as accuracy. Adapter layers introduce measurable slowdown in this setting, while LoRA does not. This is a crucial practical advantage that gets glossed over in many papers but is vital for real-world applications.
Adapter layers are small neural network modules inserted into a pre-trained model sequentially — meaning they're computed in addition to the base model computations in a chain-like fashion:
Input → Base Model Layer → Adapter Module → Next Layer → ...
Think of it like adding an extra processing step that must complete before moving forward.
LoRA, by contrast, adds trainable parameters in parallel to the base model:
Input → Base Model Layer
↓
(ΔW added here, computed together with original weights)
↓
Output
The update matrix (from earlier sections) is computed simultaneously with the base computation, not sequentially after it.
When computation happens sequentially, you must complete one step before starting the next. This creates latency — the time between input and output.
When computation happens in parallel, it can be fused into a single operation, avoiding extra roundtrips through the hardware.
The authors quantify this difference by measuring:
The figure shows percentage slowdown across different configurations. The critical observation is:
Slowdown depends heavily on batch size and sequence length:
Small batch size + short sequences (typical of online inference):
Large batch size + long sequences (typical of batch processing):
GPUs excel at parallel processing — executing the same operation on many data points simultaneously. Think of it like this:
The paper emphasizes "online inference where the batch size is small" — this is the real-world deployment scenario:
In this scenario:
Recall from the abstract that LoRA has this key claim:
"unlike adapters, no additional inference latency"
This appendix section proves that claim experimentally. The architectural difference (parallel vs. sequential) translates directly into:
versus
where the adapter term can be 20-30% significant.
| Aspect | Adapters | LoRA |
|---|---|---|
| Architecture | Sequential (added on top) | Parallel (fused in) |
| Inference Latency | +20-30% (for small batches) | +0% |
| Training Efficiency | Good | Better |
| Practical Deployment | Problematic for online inference | Ideal |
Imagine two designs for adding a security checkpoint to a building:
When you have many people (large batch), both designs might seem similar because you're processing them simultaneously. But when people arrive one at a time (small batch), the separate checkpoint is a bottleneck.
GLUE Benchmark is a wide-ranging collection of natural language understanding tasks. It includes MNLI (inference), SST-2...
Before diving into the details, let me set the context. The LoRA paper makes claims about its effectiveness across multiple tasks and datasets. Section C is essentially the empirical foundation for those claims—it lists and describes all the datasets used in the paper's experiments.
Think of this section as the "recipe card" for the paper's experiments. Just as a chef needs to specify exact ingredients and their quantities for a recipe to be reproducible, researchers need to document which datasets they used, how large they are, and what they measure. This allows other researchers to:
The section introduces six major dataset collections used in the LoRA experiments. Let me break down what each one measures and why it matters:
The GLUE (General Language Understanding Evaluation) benchmark is a meta-dataset—a collection of 8 different natural language understanding tasks:
| Task | Type | What It Measures |
|---|---|---|
| MNLI | Inference | Can the model determine if sentence A logically follows from sentence B? |
| SST-2 | Sentiment Analysis | Does the model correctly identify if text expresses positive or negative sentiment? |
| MRPC | Paraphrase Detection | Can the model recognize when two sentences mean the same thing? |
| CoLA | Linguistic Acceptability | Does the model understand grammatical correctness? |
| QNLI | Inference | Can the model answer whether a question can be answered by a given sentence? |
| QQP | Question-Answering Similarity | Can the model identify if two questions are semantically equivalent? |
| RTE | Inference | Can the model recognize textual entailment (logical relationships)? |
| STS-B | Textual Similarity | Can the model measure the degree of semantic similarity between sentence pairs? |
Why use GLUE? It's a standard benchmark in NLP research, so using it allows LoRA's results to be compared against many other methods' published results.
SELECT COUNT(*) FROM restaurants WHERE country = 'France'(Restaurant_Name: The Eagle, Food_Type: Italian, Customer_Rating: 8/10) → Output: "The Eagle is a great Italian restaurant with an 8 out of 10 customer rating."Here's the strategic organization of these datasets:
Classification & Understanding Tasks (GLUE): Tests whether LoRA maintains performance on traditional NLU tasks
Structured Output Tasks (WikiSQL): Tests whether LoRA works when outputs must follow specific formats
Abstractive Generation Tasks (SAMSum, E2E NLG, DART, WebNLG): Tests whether LoRA works on open-ended generation, which requires more nuanced language understanding
This diversity is important for the paper's main claim: that LoRA is a general-purpose adaptation method that works across different task types.
When the section specifies sizes like "56,355 training / 8,421 validation," these numbers matter because:
Recall from Appendix A that the paper showed fine-tuning substantially outperforms few-shot learning (see Table 8). The diversity of datasets in Section C demonstrates that this advantage of fine-tuning—which LoRA aims to preserve—applies across many different task types, not just one or two special cases.
Summary: Section C is a detailed inventory of the experimental foundation. It shows the paper tested LoRA on 6 different dataset collections covering classification, structured output, and generative tasks. This breadth allows readers to assess whether LoRA's claimed improvements are genuine advances across the board, or limited to specific task types.
D.1 RoBERTa: We train using AdamW with a linear learning rate decay schedule. We sweep learning rate, number of training...
Before diving into the technical details of LoRA, it's crucial to understand how the researchers actually trained and evaluated their models. This section is essentially documenting the "experimental recipe" — the specific choices and parameter settings used for each model architecture tested in the paper.
Why is this important? Because machine learning results are sensitive to hyperparameter choices. The same algorithm can produce vastly different results depending on learning rates, batch sizes, number of training epochs, etc. By documenting these choices explicitly, the authors enable:
Think of hyperparameters as the "knobs and dials" on an experimental apparatus. This section tells us exactly what position each knob was turned to for each model.
Hyperparameters are the external configuration choices that we set before training begins (as opposed to the model parameters, which are learned during training). The key hyperparameters discussed here are:
Context: RoBERTa is a BERT-variant model (see the abstract context — these are Transformer-based models that the authors evaluated).
Key details:
Optimizer and Schedule:
Mathematically, if is the initial learning rate and training lasts for total steps, at step the learning rate is:
Hyperparameter Sweep: The authors explicitly "swept" (i.e., tried multiple values of) three hyperparameters:
This means they didn't just pick one value; they tried many combinations and chose the best performing ones.
Transfer Learning Initialization Trick:
Statistical Reporting:
Context: DeBERTa is another Transformer variant that the authors evaluate.
Key differences from RoBERTa:
Similar optimizer structure: AdamW with linear learning rate decay (same as RoBERTa)
Different hyperparameter sweep: Instead of sweeping over learning rate, epochs, and batch size, they tune:
Why these differences? The authors explicitly state they followed He et al. (2021)'s methodology to ensure fair comparison with DeBERTa's original results.
Fair comparison principle: They use the exact same sequence length (maximum input length) that was used in the original DeBERTa paper, ensuring they're not inadvertently giving LoRA an advantage or disadvantage.
Context: GPT-2 is a generative language model (unlike RoBERTa and DeBERTa, which are encoder-only models).
Notable constraints:
Fixed training length: All GPT-2 models trained for exactly 5 epochs
Hyperparameter source: The specific values for batch size, learning rate, and beam search beam size all come from Li & Liang (2021)
Consistency across models: By using established values from prior work, the authors ensure they're making a fair comparison
Context: GPT-3 is the largest model tested (175 billion parameters, as mentioned in the abstract).
Simpler configuration:
Fixed training protocol:
Why so simple? For such a large model on large-scale data:
To make this more concrete, let's think about what a linear decay schedule means. If we're training for total steps and start with learning rate , the effective learning rate at step is:
At (start): (full learning rate)
At (end): (no learning rate — training stops effectively)
This schedule helps because:
This section exemplifies scientific rigor in machine learning. By documenting:
The authors ensure that other researchers can reproduce their results and verify their claims about LoRA's effectiveness. This is particularly important because LoRA's main contribution is not inventing a new algorithm, but rather showing that a relatively simple modification to fine-tuning can match full fine-tuning performance while being vastly more parameter-efficient.
LoRA can be naturally combined with existing prefix-based approaches. LoRA+PrefixEmbed (LoRA+PE) combines LoRA with pref...
This section explores whether LoRA can be combined with other parameter-efficient fine-tuning methods, specifically prefix tuning approaches. The key question is: are these methods complementary (can they work together synergistically) or redundant (do they solve the same problem)?
Think of it this way: LoRA adds trainable low-rank matrices to the weight layers, while prefix tuning adds trainable tokens or hidden vectors. The section tests whether combining both approaches gives better results than either alone.
From earlier sections, LoRA freezes the pretrained model weights and injects trainable rank decomposition matrices. For a weight matrix in a Transformer layer, instead of updating directly, LoRA computes updates as a product of two low-rank matrices: where and , with (rank much smaller than dimension).
Prefix tuning is an alternative parameter-efficient method that adds learnable parameters to the input of a model. There are two variants mentioned here:
1. Prefix-Embedding Tuning (PE):
2. Prefix-Layer Tuning (PL):
What happens:
Intuition: LoRA modifies how the model processes information (through weight updates), while prefix-embedding adds what information enters the model at the start. These operate at different "levels," so they might complement each other.
What happens:
Intuition: This is more aggressive, because prefix-layer tuning already modifies representations throughout the model (like LoRA does), potentially creating redundancy.
On WikiSQL dataset:
On MultiNLI dataset:
Overall performance:
Why does this happen? The paper identifies the culprit: hyperparameter sensitivity
The success of LoRA+PE demonstrates that LoRA solves a different problem than prefix-embedding tuning. In linear algebra, two subspaces are orthogonal if they don't overlap. Here, the analogy is:
The failure of LoRA+PL suggests these methods are not orthogonal:
If we denote:
The effectiveness of a combination depends on how much the parameter spaces overlap:
LoRA+PE: The parameter spaces are largely disjoint
This is why WikiSQL shows additive improvements.
LoRA+PL: The parameter spaces have significant overlap (both modify throughout the network)
The optimization becomes harder, leading to subadditive gains.
F.1 Additional Experiments on GPT-2: We repeat our experiment on DART and WebNLG following the setup of Li & Liang (2021...
In this paper we use the measure $\phi(A, B, i, j) = \psi(U_A^i, U_B^j) = \frac{\|U_A^{i\top} U_B^j\|_F^2}{\min\{i,j\}}$...
H.1 Correlation Between LoRA Modules: See Figure 6 and Figure 7 for how the results presented in Figure 3 and Figure 4 g...