Linearization Explains Fine-Tuning in Large Language Models

Zahra Rahimi Afzal; Tara Esmaeilbeig; Mojtaba Soltanalian; Mesrob I. Ohannessian

Linearization Explains Fine-Tuning in Large Language Models

Zahra Rahimi Afzal, Tara Esmaeilbeig, Mojtaba Soltanalian, Mesrob I. Ohannessian

TL;DR

The paper tackles why parameter-efficient fine-tuning (PEFT) works by analyzing training dynamics through linearization around a pretrained model and the Neural Tangent Kernel (NTK). It introduces an explicit proximal regularization toward the pretrained parameters, making fine-tuning behave like NTK regression in a lazy regime. The authors derive parameter-distance bounds, NTK-spectral bounds, and spectral-perturbation results for adding trainable layers, showing that the NTK spectrum at initialization predicts adaptation performance and can guide layer selection. Empirical validation on RoBERTa-base with LoRA across GLUE, IMDb, and Yelp supports the theory and suggests practical, low-overhead strategies for designing PEFT with improved efficiency in large language models.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an Euclidean distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized fine-tuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning. We empirically validate our theory on Low Rank Adaptation (LoRA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs.

Linearization Explains Fine-Tuning in Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 11 theorems, 89 equations, 3 figures, 9 tables, 2 algorithms)

This paper contains 24 sections, 11 theorems, 89 equations, 3 figures, 9 tables, 2 algorithms.

Introduction
Problem Formulation
Proximity to the Pretrained Model Promotes Linearity
Fine-Tuning Meets Neural Tangent Kernel Regression
Spectral Perturbation of Layers
Experiments
Model, Datasets and Optimizer
Linearization of Fine-Tuning
NTK Evaluation
Conclusion and Limitations
Definitions and Lemmas
Proof of Theorem \ref{['thm::1']}
Proof of Theorem \ref{['thm::theta_bound']}
Proof of Theorem \ref{['thm::diff-f-bar-f']}
Proof of Theorem \ref{['thm::final']}
...and 9 more sections

Key Result

Theorem 1

Under the squared loss, for any $t >0$, if $\lambda >0$ and $\nabla_{\boldsymbol{\theta}} \widetilde{\mathcal{R}}(\boldsymbol{\theta}_t) \left(\boldsymbol{\theta}_{t} - \boldsymbol{\theta}_0\right) \geq 0$, then Moreover, if $\nabla_{\boldsymbol{\theta}} \widetilde{\mathcal{R}}(\boldsymbol{\theta}_t) \left(\boldsymbol{\theta}_{t} - \boldsymbol{\theta}_0\right) < 0$, then $\lambda=0$ is a sufficie

Figures (3)

Figure 1: The NTK defines a linear function space $\mathcal{H}$ tangent to the non-linear function space $\mathcal{F}$ defined by the model. Regularized fine-tuning in the lazy regime is close to kernel regression on the tangent space. $f_{\theta^\star}(\mathbf{x})$ is the fine-tuned model obtained by empirical risk minimization. If fine-tuning remains in the linearized regime, then after $T$ steps of training $f_{\theta^\star}(\mathbf{x})\approx f_{\boldsymbol{\theta}_{0}}(\mathbf{x})+\left\langle\nabla_{\boldsymbol{\theta}} f_{\boldsymbol{\theta}_{0}}(\mathbf{x}), \bar{\boldsymbol{\theta}}_{T}-\boldsymbol{\theta}_{0}\right\rangle$ is a good approximation.
Figure 2: (a)-(b) Illustrate the positive correlation between the convergence rate of optimization steps of LoRA over $10$ epochs and $\kappa(\mathbf{K}+\sigma \mathbf{I} )$ of NTK at initialization. $\{\mathbf{W}_q,\mathbf{W}_v\}$ of layers $\{0,5,11\}$ are fine-tuned. (c) Illustrates the negative correlation between evaluation accuracy after 10 epochs of training and the condition number of NTK. LoRA with $r=8$ is used to fine-tune $\{\mathbf{W}_k\}$ of the layers $\{0,5,11\}$.
Figure 3: Empirical risk ratio $\log\left(\frac{\mathcal{R}( \boldsymbol{\theta} \cup \hat{\boldsymbol{\theta}})}{\mathcal{R}(\boldsymbol{\theta})}\right)$ and maximum eigenvalue ratio $\log \left(\frac{ \lambda_{\text{max}}(\mathbf{K}+{\mathbf{S}}+\sigma \mathbf{I})}{\lambda_{\text{max}}(\mathbf{K}+\sigma \mathbf{I})}\right)$ are used to evaluate the impact of candidate layers. Here, $\boldsymbol{\theta}$ is fixed as the weights $\{\mathbf{W}_k\}$ of layer $\{0\}$, while $\hat{\boldsymbol{\theta}}$ represents the candidate layers. The horizontal axis represents the combination of layer $\{0\}$ and different candidate layers.

Theorems & Definitions (21)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof
Theorem 5
proof
...and 11 more

Linearization Explains Fine-Tuning in Large Language Models

TL;DR

Abstract

Linearization Explains Fine-Tuning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)