Table of Contents
Fetching ...

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning

Tuc Nguyen, Thai Le

TL;DR

The paper investigates whether weight-space mixing of domain-specific adapters generalizes in in-domain settings, a question not thoroughly explored in prior work. It conducts a large-scale, in-domain evaluation across 13 diverse datasets using multiple adapter methods and exhaustively enumerates adapter mixtures to quantify generalization and adversarial robustness. A central finding is a robust negative correlation between the fraction of weight sign differences (FSD) among mixed adapters and predictive performance, which motivates FSD-guided strategies, including Greedy Adapter Mixing and FSD-based magnitude pruning that maintain performance at high sparsity. The results yield practical guidance for deploying parameter-efficient adapters in real-world scenarios and suggest pruning as a natural by-product to reduce sign-conflicts while preserving accuracy.

Abstract

Several parameter-efficient fine-tuning methods based on adapters have been proposed as a streamlined approach to incorporate not only a single specialized knowledge into existing Pre-Trained Language Models (PLMs) but also multiple of them at once. Recent works such as AdapterSoup propose to mix not all but only a selective sub-set of domain-specific adapters during inference via model weight averaging to optimize performance on novel, unseen domains with excellent computational efficiency. However, the essential generalizability of this emerging weight-space adapter mixing mechanism on \textit{unseen, in-domain examples} remains unexplored. Thus, in this study, we conduct a comprehensive analysis to elucidate the generalizability of domain-specific adapter mixtures in in-domain evaluation. We also provide investigations into the inner workings of the mixture of domain-specific adapters by analyzing their weight signs, yielding critical analysis on the negative correlation between their fraction of weight sign difference and their mixtures' generalizability.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning

TL;DR

The paper investigates whether weight-space mixing of domain-specific adapters generalizes in in-domain settings, a question not thoroughly explored in prior work. It conducts a large-scale, in-domain evaluation across 13 diverse datasets using multiple adapter methods and exhaustively enumerates adapter mixtures to quantify generalization and adversarial robustness. A central finding is a robust negative correlation between the fraction of weight sign differences (FSD) among mixed adapters and predictive performance, which motivates FSD-guided strategies, including Greedy Adapter Mixing and FSD-based magnitude pruning that maintain performance at high sparsity. The results yield practical guidance for deploying parameter-efficient adapters in real-world scenarios and suggest pruning as a natural by-product to reduce sign-conflicts while preserving accuracy.

Abstract

Several parameter-efficient fine-tuning methods based on adapters have been proposed as a streamlined approach to incorporate not only a single specialized knowledge into existing Pre-Trained Language Models (PLMs) but also multiple of them at once. Recent works such as AdapterSoup propose to mix not all but only a selective sub-set of domain-specific adapters during inference via model weight averaging to optimize performance on novel, unseen domains with excellent computational efficiency. However, the essential generalizability of this emerging weight-space adapter mixing mechanism on \textit{unseen, in-domain examples} remains unexplored. Thus, in this study, we conduct a comprehensive analysis to elucidate the generalizability of domain-specific adapter mixtures in in-domain evaluation. We also provide investigations into the inner workings of the mixture of domain-specific adapters by analyzing their weight signs, yielding critical analysis on the negative correlation between their fraction of weight sign difference and their mixtures' generalizability.
Paper Structure (31 sections, 2 equations, 15 figures, 17 tables, 4 algorithms)

This paper contains 31 sections, 2 equations, 15 figures, 17 tables, 4 algorithms.

Figures (15)

  • Figure 1: Mixing the adapter weights across various tasks may result in the importance weights of individual tasks nullifying each other, thereby yielding a merged mixture losing important information.
  • Figure 2: (a) Datasets' semantic similarity via cosine-similarity among centroids of Universal Sentence Encoder (USE) cer2018universal embeddings of 1K randomly sampled documents from each dataset. (b) Topic distributions via Latent Dirichlet Allocation (LDA) blei2003latent.
  • Figure 3: Accuracy of RoBERTa with Pfeiffer pfeiffer2020adapterfusion in each target domain. X-axis denotes the number of mixed adapters.
  • Figure 4: Heatmap visualization of the Fraction of Sign Difference (in %) of Pfeiffer Adapters pfeiffer2020adapterfusion trained on 13 domain-specific tasks with RoBERTa..
  • Figure 5: FSD when mixing two ($k{=}2$) adapters. Sky-blue bars show the FSD (left y-axis). Dashed blue lines denote the accuracy achieved by a standalone adapter. Solid red lines illustrate the variations in accuracy after mixing. Please refer to Fig. \ref{['pfeiffer_roberta_two_adapters_diff_full']} in Appendix \ref{['two_adapter_diff_full']} for results in other tasks.
  • ...and 10 more figures