Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Xinhao Yao; Hongjin Qian; Xiaolin Hu; Gengze Xu; Wei Liu; Jian Luan; Bin Wang; Yong Liu

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Xinhao Yao, Hongjin Qian, Xiaolin Hu, Gengze Xu, Wei Liu, Jian Luan, Bin Wang, Yong Liu

TL;DR

This work reveals two attention-centered phenomena that govern fine-tuning efficiency in Transformer-based LLMs: (i) Unequal Importance of Attention Matrices, where tuning $W_v$ dominates and focusing on $W_q$ and $W_v$ can match or outperform full $W_q,W_k,W_v$ tuning while reducing parameter counts; and (ii) Attention Matrices with Customized Learning Rate, which shows that allocating a larger learning rate to $W_v$ accelerates convergence. The authors formalize a unified PEFT perspective, derive information-theoretic generalization bounds favoring the $W_q$+$W_v$ subset at fixed rank, and analyze learning-rate dynamics to justify a larger $ ext{LR}_V$ relative to $ ext{LR}_{QK}$. An actionable example demonstrates a storage- and time-efficient fine-tuning strategy—freeze $W_k$ and train $W_q$ and $W_v$ with distinct learning rates—validated on GLUE benchmarks with RoBERTa-base and Llama3.1-8b, achieving competitive performance with significantly fewer trainable parameters. Collectively, the results provide a theoretical foundation and practical guidance for designing lightweight, PEFT-based fine-tuning methods for Transformer architectures.

Abstract

Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$ denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed "Unequal Importance of Attention Matrices", highlights the impact of fine-tuning different weight matrices. It shows that optimizing the $\mathbf{W}_v$ matrix yields significantly better performance than optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices ($\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$). The second phenomenon,"Attention Matrices with Customized Learning Rate Lead to Better Convergence", emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the $\mathbf{W}_v$ matrix compared to $\mathbf{W}_q$ and $\mathbf{W}_k$ accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

TL;DR

This work reveals two attention-centered phenomena that govern fine-tuning efficiency in Transformer-based LLMs: (i) Unequal Importance of Attention Matrices, where tuning

dominates and focusing on

and

can match or outperform full

tuning while reducing parameter counts; and (ii) Attention Matrices with Customized Learning Rate, which shows that allocating a larger learning rate to

accelerates convergence. The authors formalize a unified PEFT perspective, derive information-theoretic generalization bounds favoring the

subset at fixed rank, and analyze learning-rate dynamics to justify a larger

relative to

. An actionable example demonstrates a storage- and time-efficient fine-tuning strategy—freeze

and train

and

with distinct learning rates—validated on GLUE benchmarks with RoBERTa-base and Llama3.1-8b, achieving competitive performance with significantly fewer trainable parameters. Collectively, the results provide a theoretical foundation and practical guidance for designing lightweight, PEFT-based fine-tuning methods for Transformer architectures.

Abstract

, and

denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed "Unequal Importance of Attention Matrices", highlights the impact of fine-tuning different weight matrices. It shows that optimizing the

matrix yields significantly better performance than optimizing the

matrix. Fine-tuning only the

and

matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices (

, and

). The second phenomenon,"Attention Matrices with Customized Learning Rate Lead to Better Convergence", emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the

matrix compared to

and

accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.

Paper Structure (29 sections, 4 theorems, 19 equations, 5 figures, 3 tables)

This paper contains 29 sections, 4 theorems, 19 equations, 5 figures, 3 tables.

Introduction
Preliminaries and Background
Advantages and Generalization Analysis
Empirical Advantages
Information-Theoretic Generalization Bounds
Convergence Analysis in Optimization
An Insight into Inefficient Learning
Convergence Analysis for Learning Rate
An Example of Improving Fine-tuning
Conclusion and Limitation
Omitted Proofs and Additional Results
The connection between Prefix tuning and LoRA.
Proof of Theorem 1
Initialization Discussion
Gamma Function
...and 14 more sections

Key Result

Theorem 1

Consider the algorithms of def1. Assume the loss $\ell(\mathbf{W}, Z)$ is R-subGaussian under $(\Delta \mathbf{W}, {{Z}}) \sim P_{\Delta \mathbf{W} \mid \mathbf{W}} \times \mu$. Then (See Appendix pf:t1 for a proof), where $\mathbf{W}_q^i,\mathbf{W}_k^i,\mathbf{W}_v^i\in \mathbb{R}^{d_{in}\times d_{out}}$.

Figures (5)

Figure 1: The test accuracy of RoBERTa-base fine-tuning was evaluated over 3 epochs for MNLI, QQP, and QNLI, and 6 epochs for SST-2, with a sequence length $T=128$ and using half-precision (FP16). The LoRA hyperparameters were set to $\alpha=r=8$. All reported values represent the average results across 3 random seeds. We use red color to highlight (1) the best overall accuracy and (2) the values where $\eta_V/\eta_{QK}=1$. For better visualization, when accuracy is lower than a fixed threshold, we set it to threshold.
Figure 2: Left: The test accuracy of Llama3.1-8b fine-tuning was evaluated over 800 steps for MNLI. Key values like Figure \ref{['figure1']} are also shown in red. Right: The training loss over 800 steps for MNLI fine-tuning on Llama3.1-8b, showing comparison between two optimal learning rate $(\eta_{QK},\eta_V)$ settings in Left: (1) with $\eta_V=\eta_{QK}$ (2) with $\eta_V>>\eta_{QK}$.
Figure 3: The train loss of RoBERTa-base fine-tuning. Other settings are same to Figure \ref{['figure1']}.
Figure 4: The train loss of Llama3.1-8b fine-tuning. Other settings are same to Figure \ref{['figure2']}.
Figure 5: A brief diagram outlining how our theoretical insights guide the experiments.

Theorems & Definitions (11)

Remark 1
Definition 1: Fine-tuning algorithms
Theorem 1: Generalization bounds on adapting $\mathbf{W}_q\&\mathbf{W}_v$ and/or $\mathbf{W}_k$
Remark 2: Discussion of the advantages
Remark 3
Theorem 2: Efficient fine-tuning in attention mechanism (Informal)
Remark 4
Lemma 1
Lemma 2
proof
...and 1 more

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

TL;DR

Abstract

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)