Advancing Parameter Efficiency in Fine-tuning via Representation Editing

Muling Wu; Wenhao Liu; Xiaohua Wang; Tianlong Li; Changze Lv; Zixuan Ling; Jianhao Zhu; Cenyuan Zhang; Xiaoqing Zheng; Xuanjing Huang

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

TL;DR

This paper introduces Representation Editing (RED), a parameter-efficient fine-tuning approach that freezes model weights and learns two vectors to edit layer representations, primarily within FFN sub-layers. RED substantially reduces trainable parameters (e.g., ~0.26M for 7B LLaMA-2, ~25,700× fewer than full fine-tuning, ~32× fewer than LoRA) while achieving competitive or superior performance across RoBERTa, GPT-2, T5, and LLaMA-2 on tasks like GLUE and E2E NLG. Extensive ablations show that both scaling and bias editing contribute to gains, with bias being particularly impactful, and expanding editing to additional representations can boost results with moderate parameter increases. The results suggest RED as a practical and scalable PEFT strategy for large-scale neural models, with potential applicability beyond NLP and to few-shot settings; the authors also provide open-source code for reproducibility.

Abstract

Parameter Efficient Fine-Tuning (PEFT) techniques have drawn significant attention due to their ability to yield competitive results while updating only a small portion of the adjustable parameters. However, existing PEFT methods pose challenges in hyperparameter selection, such as choosing the rank for LoRA or Adapter, or specifying the length of soft prompts. To address these challenges, we propose a novel fine-tuning approach for neural models, named Representation EDiting (RED), which modifies the representations generated at some layers through the application of scaling and biasing operations. While existing PEFT methods still demonstrate over-parameterization that could potentially undermine the generalization ability acquired from pre-training, RED can substantially reduce the number of trainable parameters by a factor of 25, 700 compared to full parameter fine-tuning and by a factor of 32 relative to LoRA. Remarkably, RED achieves results comparable or superior to both full parameter fine-tuning and other PEFT methods. Extensive experiments across various model architectures and scales, including RoBERTa, GPT-2, T5, and LLaMA-2, have demonstrated the effectiveness and efficiency of RED1, thereby positioning it as a promising PEFT strategy for large-scale neural models.

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 4 figures, 19 tables)

This paper contains 30 sections, 3 equations, 4 figures, 19 tables.

Introduction
Related Work
Method
Recap of PEFT Methods
Representation Editing
Experiments
Baselines
Results with RoBERTa
Results with GPT-2
Results with T5
Results with LLaMA-2
Ablation Study
Impact of Different Editing Operators
Impact of Editing Positions
Parameter Efficiency and Efficacy
...and 15 more sections

Figures (4)

Figure 1: Comparison of previous representative PEFT methods with the proposed RED. (a) LoRA incorporates learnable bottleneck-shaped modules (highlighted in orange) by integrating additional connections parallel to the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices of attention blocks, along with modifying the weights of these matrices in a low-rank fashion. Adapter, on the other hand, introduces learnable modules within similar structures (also highlighted in orange) by incorporating additional connections following both the attention and feed-forward sub-layers. (b) RED introduces two learnable vectors, $l_\text{scaling}$ and $l_\text{bias}$, to directly edit the representations (marked in green) generated by feed-forward sub-layers, which significantly reduces the number of parameters required for fine-tuning.
Figure 2: Performance scores achieved by RED and other PEFT methods on the MT-Bench. Refer to Table \ref{['tab:mt-details']} and Appendix \ref{['appendix:llama-2 results']} for raw scores and additional details.
Figure 3: The model fine-tuned with RED generates a thorough, sequential guide that offers accurate details, facilitating comprehension even for novices. This guide encompasses elements such as preparation, threading, positioning, sewing techniques, and post-sewing cleanup, while also providing safety advice and promoting testing for secure attachment. In contrast, the LoRA-trained response inaccurately concentrates on buttonhole creation rather than button sewing, potentially leading to confusion for individuals seeking button attachment guidance. The response generated by the full-parameter trained model presents a simplified summary, but it lacks the in-depth explanation and precision of RED, rendering it less informative for those unfamiliar with the sewing process.
Figure 4: The model fine-tuned using RED generates a comprehensive and proactive strategy, addressing immediate issues, potential symptoms to monitor, and the significance of veterinary consultation. It offers an overarching safety evaluation of the Ranunculaceae family, indicating potentially toxic members and highlighting the necessity for professional assessment. This response strikes a balance between informative content and practical guidance, empowering pet owners to act in their pet's best interests, even in the absence of specific plant identification. In contrast, the responses produced by the models trained with full parameters and LoRA place a greater emphasis on collecting further information before offering advice, which could inadvertently postpone critical care in an emergent situation.

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

TL;DR

Abstract

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)