Table of Contents
Fetching ...

Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models

Yuyan Chen, Qiang Fu, Ge Fan, Lun Du, Jian-Guang Lou, Shi Han, Dongmei Zhang, Zhixu Li, Yanghua Xiao

TL;DR

This work tackles the high parameter cost of fine-tuning large pre-trained language models by introducing the Hadamard adapter, a lightweight module that operates on self-attention outputs using an element-wise transformation $A' = W \odot A + b$ inserted after the self-attention layer. The authors demonstrate, through empirical analysis and GLUE-based experiments across multiple PLMs, that training only a small classifier followed by freezing most PLM parameters and tuning the Hadamard adapter plus normalization yields competitive performance while using only about $0.033\%$ of the parameters of full fine-tuning (potentially as low as $0.022\%$ after pruning). They provide evidence that linear fitting of self-attention outputs suffices to approximate full fine-tuning and identify bias terms and the normalization module as particularly impactful, guiding design choices for extreme parameter efficiency. The study also shows potential for shared adapters across tasks from exploratory analyses, suggesting a path toward further reductions in parameter footprints with multi-task shared components. Overall, the Hadamard adapter offers a practical, scalable approach to adapt large PLMs to downstream tasks with minimal storage and computation.

Abstract

Recent years, Pre-trained Language models (PLMs) have swept into various fields of artificial intelligence and achieved great success. However, most PLMs, such as T5 and GPT3, have a huge amount of parameters, fine-tuning them is often expensive and time consuming, and storing them takes up a lot of space. Therefore, it is necessary to adopt a parameter-efficient approach to reduce parameters of PLMs in fine-tuning without compromising their performance in downstream tasks. In this paper, we design a novel adapter which only acts on self-attention outputs in PLMs. This adapter adopts element-wise linear transformation using Hadamard product, hence named as Hadamard adapter, requires the fewest parameters compared to previous parameter-efficient adapters. In addition, we also summarize some tuning patterns for Hadamard adapter shared by various downstream tasks, expecting to provide some guidance for further parameter reduction with shared adapters in future studies. The experiments conducted on the widely-used GLUE benchmark with several SOTA PLMs prove that the Hadamard adapter achieves competitive performance with only 0.033\% parameters compared with full fine-tuning, and it has the fewest parameters compared with other adapters. Moreover, we further find that there is also some redundant layers in the Hadamard adapter which can be removed to achieve more parameter efficiency with only 0.022\% parameters.

Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models

TL;DR

This work tackles the high parameter cost of fine-tuning large pre-trained language models by introducing the Hadamard adapter, a lightweight module that operates on self-attention outputs using an element-wise transformation inserted after the self-attention layer. The authors demonstrate, through empirical analysis and GLUE-based experiments across multiple PLMs, that training only a small classifier followed by freezing most PLM parameters and tuning the Hadamard adapter plus normalization yields competitive performance while using only about of the parameters of full fine-tuning (potentially as low as after pruning). They provide evidence that linear fitting of self-attention outputs suffices to approximate full fine-tuning and identify bias terms and the normalization module as particularly impactful, guiding design choices for extreme parameter efficiency. The study also shows potential for shared adapters across tasks from exploratory analyses, suggesting a path toward further reductions in parameter footprints with multi-task shared components. Overall, the Hadamard adapter offers a practical, scalable approach to adapt large PLMs to downstream tasks with minimal storage and computation.

Abstract

Recent years, Pre-trained Language models (PLMs) have swept into various fields of artificial intelligence and achieved great success. However, most PLMs, such as T5 and GPT3, have a huge amount of parameters, fine-tuning them is often expensive and time consuming, and storing them takes up a lot of space. Therefore, it is necessary to adopt a parameter-efficient approach to reduce parameters of PLMs in fine-tuning without compromising their performance in downstream tasks. In this paper, we design a novel adapter which only acts on self-attention outputs in PLMs. This adapter adopts element-wise linear transformation using Hadamard product, hence named as Hadamard adapter, requires the fewest parameters compared to previous parameter-efficient adapters. In addition, we also summarize some tuning patterns for Hadamard adapter shared by various downstream tasks, expecting to provide some guidance for further parameter reduction with shared adapters in future studies. The experiments conducted on the widely-used GLUE benchmark with several SOTA PLMs prove that the Hadamard adapter achieves competitive performance with only 0.033\% parameters compared with full fine-tuning, and it has the fewest parameters compared with other adapters. Moreover, we further find that there is also some redundant layers in the Hadamard adapter which can be removed to achieve more parameter efficiency with only 0.022\% parameters.
Paper Structure (16 sections, 5 equations, 5 figures, 5 tables)

This paper contains 16 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The distribution of the norm of the self-attention outputs among all tasks before (a) and after fine-tuning (b), and the corresponding changes (c) in each layer.
  • Figure 2: The average values of each token in a sequence of all tasks (a), the characteristic value distribution among all tasks (b), and the average characteristic values of all tasks (c) based on full fine-tuning and different fitting functions, respectively, in each hidden layer.
  • Figure 3: The framework of the Hadamard adapter (A), and the process of the parameter-efficient adapter tuning method, including two parts: (a) Train the classifier; (b) Inject the Hadamard adapter for self-attention outputs and unfreeze the normalization module.
  • Figure 4: The influence of different number of unfreezing layers of the Hadamard adapter on the performance of the Hadamard adapter with model of base version (a) and large version (b).
  • Figure 5: Each module in the Hadamard adapter based on each layer to answer three questions. Question one corresponds to (a$_1$) and (a$_2$), Question 1 corresponds to from (b$_1$) to (b$_4$), and Question 3 corresponds to (c$_1$) and (c$_2$)