Table of Contents
Fetching ...

HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation

Geyuan Zhang, Xiaofei Zhou, Chuheng Chen

TL;DR

This work tackles the high computational cost of fine-tuning large pre-trained language models by introducing a direct Updated Transformation (UT) paradigm, which preserves a strong correlation between original and updated weights. Building on UT, the Hadamard Updated Transformation (HUT) uses a Hadamard product with two low-rank matrices to update weight matrices in transformers, achieving a richer but more efficient parameter update. The authors demonstrate, through experiments on RoBERTa-large (GLUE) and GPT-2 (E2E NLG), that HUT attains competitive or state-of-the-art performance while significantly reducing training FLOPs and maintaining zero inferences latency. This approach offers a principled, computation-efficient alternative to conventional PEFT methods and highlights the practical impact of maintaining original–updated parameter correlations during fine-tuning.

Abstract

Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.

HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation

TL;DR

This work tackles the high computational cost of fine-tuning large pre-trained language models by introducing a direct Updated Transformation (UT) paradigm, which preserves a strong correlation between original and updated weights. Building on UT, the Hadamard Updated Transformation (HUT) uses a Hadamard product with two low-rank matrices to update weight matrices in transformers, achieving a richer but more efficient parameter update. The authors demonstrate, through experiments on RoBERTa-large (GLUE) and GPT-2 (E2E NLG), that HUT attains competitive or state-of-the-art performance while significantly reducing training FLOPs and maintaining zero inferences latency. This approach offers a principled, computation-efficient alternative to conventional PEFT methods and highlights the practical impact of maintaining original–updated parameter correlations during fine-tuning.

Abstract

Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.
Paper Structure (29 sections, 11 equations, 4 figures, 7 tables)

This paper contains 29 sections, 11 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Parameter updating procedure through Incremental Update and our Transformation Update. Most of existing PEFT methods learn a incremental update by adding $\Delta W$ to original weight matrix $W_0$, while we proposed direct update method that uses an update transformation to get $W_{new}$.
  • Figure 2: (a) Our proposed HUT can maintain a strong correlation between $W_0$ and $U'(W)$ so that the learned $U'(W)$ can leverage the semantic features learned during training. (b) The design of HUT Module.
  • Figure 3: Average scores in GLUE benchmark based on RoBERTa with different PEFT methods. The x-axis is the number of GFLOPs, which indicates the computation complexity, and the y-axis is the average scores.
  • Figure 4: Visualization of some results. The shades of red indicate the degree of emphasis that the fine-tuned model places on different words.