Table of Contents
Fetching ...

The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification

Minh Duc Bui, Katharina von der Wense

TL;DR

This paper tackles the multi-dimensional evaluation of adapter modules for text classification, addressing performance, efficiency, and fairness. It empirically compares full fine-tuning against adapters (Adapters and LoRA) across three datasets (Jigsaw, HateXplain, BIOS) and four base LMs, finding that adapters largely match full finetuning in accuracy while substantially reducing training time. However, fairness effects are mixed and highly dependent on the baseline model's bias, with potential for bias amplification in high-bias scenarios. The work advocates case-by-case fairness assessment and highlights limitations such as scope to text classification and model selection, emphasizing practical implications for deploying adapter-based methods in trustworthy NLP.

Abstract

Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard fine-tuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the impact of adapter modules on bias becomes more unpredictable, introducing the risk of significantly magnifying these biases for certain groups. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.

The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification

TL;DR

This paper tackles the multi-dimensional evaluation of adapter modules for text classification, addressing performance, efficiency, and fairness. It empirically compares full fine-tuning against adapters (Adapters and LoRA) across three datasets (Jigsaw, HateXplain, BIOS) and four base LMs, finding that adapters largely match full finetuning in accuracy while substantially reducing training time. However, fairness effects are mixed and highly dependent on the baseline model's bias, with potential for bias amplification in high-bias scenarios. The work advocates case-by-case fairness assessment and highlights limitations such as scope to text classification and model selection, emphasizing practical implications for deploying adapter-based methods in trustworthy NLP.

Abstract

Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard fine-tuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the impact of adapter modules on bias becomes more unpredictable, introducing the risk of significantly magnifying these biases for certain groups. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.
Paper Structure (24 sections, 3 figures, 6 tables)

This paper contains 24 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: We display our main results on Jigsaw, HateXplain and BIOS dataset. We plot the difference to the base variant. The color of the plane indicates an improvement (green) or degradation (red). Exact numerical values with standard deviation can be found in the Appendix, see Table \ref{['table:results']} and Table \ref{['table:bios_results']}.
  • Figure 2: Variance increases with higher bias levels. Boxplots depict fairness differences between the base module and adapter modules across diverse bias levels on group-level inherent in the base model. The color of the plane indicates an improvement (green) or degradation (red) while no color signifies no clear direction.
  • Figure 3: Balanced accuracy and equalized odds metrics for BERT+Adapters, RoBERTa+Adapters, and GPT-2+Adapters with different reduction factors {2, 16, 64}.