LoFiT: Localized Fine-tuning on LLM Representations

Fangcong Yin; Xi Ye; Greg Durrett

LoFiT: Localized Fine-tuning on LLM Representations

Fangcong Yin, Xi Ye, Greg Durrett

TL;DR

This work introduces a framework called Localized Fine-Tuning on LLM Representations (LoFiT), which identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads.

Abstract

Recent work in interpretability shows that large language models (LLMs) can be adapted for new tasks in a learning-free way: it is possible to intervene on LLM representations to elicit desired behaviors for alignment. For instance, adding certain bias vectors to the outputs of certain attention heads is reported to boost the truthfulness of models. In this work, we show that localized fine-tuning serves as an effective alternative to such representation intervention methods. We introduce a framework called Localized Fine-Tuning on LLM Representations (LoFiT), which identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. LoFiT localizes to a sparse set of heads (3%-10%) and learns the offset vectors from limited training data, comparable to the settings used for representation intervention. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention. We also find that the localization step is important: selecting a task-specific set of attention heads can lead to higher performance than intervening on heads selected for a different task. Finally, across 7 tasks we study, LoFiT achieves comparable performance to other parameter-efficient fine-tuning methods such as LoRA, despite modifying 20x-200x fewer parameters than these methods.

LoFiT: Localized Fine-tuning on LLM Representations

TL;DR

Abstract

Paper Structure (59 sections, 3 equations, 6 figures, 8 tables)

This paper contains 59 sections, 3 equations, 6 figures, 8 tables.

Introduction
Background: Localized Representation Intervention
Preliminaries: Transformer Architecture
Localized Representation Intervention (on Attention Heads)
Instantiations in the literature
LoFiT: Localized Representation Fine-Tuning
Experimental Setup
Representation Intervention Baselines
Models and Training
Results: Effectiveness of Localization
Importance of LoFiT Heads
Results
Task Specificity of Localized Interventions
Granularity of Localization
Comparison with PEFT Methods
...and 44 more sections

Figures (6)

Figure 1: LoFiT methodology. LoFiT freezes all pre-trained weights of a transformer language model and uses two sets of lightweight parameters to modify the LLM representations in two steps: Attention Head Selection and Bias Tuning. Only the tuned biases are used in the final model.
Figure 2: Test accuracy of using LoFiT heads learned from a different task. Colors reflect relative accuracy with respect to using same-task heads, with same-task heads (diagonals) representing $100\%$ relative accuracy. Different-task results with $*$ are significantly lower than the same-task result at the significance level of $0.05$ with a paired bootstrap test and results with $+$ are significantly lower at the level of $0.1$. For TruthfulQA, we report MC1 accuracy. Across models, task-specific heads consistently outperform different-task heads for TruthfulQA and MQuAKE.
Figure 3: Distribution of LoFiT heads over layers for different tasks. Across tasks, LoFiT heads are often located in different parts of the model, and layer selection differs between Llama2 and Gemma.
Figure 4: LoFiT performance using different numbers of training examples $n$ on CLUTRR and MQuAKE with Llama 2-7B. For LoFiT, we tune $10\%$ of the attention heads. Results are averaged over two runs. In the low data settings ($n \leq 100$), LoFiT is more data efficient than LoRA and RED. For $n \geq 300$, LoFiT is still comparable to LoRA and RED with fewer parameters.
Figure 5: The effects of the percentage of attention heads $K$ used for LoFiT Bias Tuning on LoFiT performance. Results are averaged over two runs. The test accuracy increases with $K$ when $K < 10\%$ and plateaus when $K$ reaches $10\%-20\%$.
...and 1 more figures

LoFiT: Localized Fine-tuning on LLM Representations

TL;DR

Abstract

LoFiT: Localized Fine-tuning on LLM Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)