Table of Contents
Fetching ...

Towards Optimal Adapter Placement for Efficient Transfer Learning

Aleksandra I. Nowak, Otniel-Bogdan Mercea, Anurag Arnab, Jonas Pfeiffer, Yann Dauphin, Utku Evci

TL;DR

It is revealed that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.

Abstract

Parameter-efficient transfer learning (PETL) aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. Adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections, achieving performance comparable to full fine-tuning with significantly fewer parameters. This paper investigates the relationship between the placement of an adapter and its performance. We observe that adapter location within a network significantly impacts its effectiveness, and that the optimal placement is task-dependent. To exploit this observation, we introduce an extended search space of adapter connections, including long-range and recurrent adapters. We demonstrate that even randomly selected adapter placements from this expanded space yield improved results, and that high-performing placements often correlate with high gradient rank. Our findings reveal that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.

Towards Optimal Adapter Placement for Efficient Transfer Learning

TL;DR

It is revealed that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.

Abstract

Parameter-efficient transfer learning (PETL) aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. Adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections, achieving performance comparable to full fine-tuning with significantly fewer parameters. This paper investigates the relationship between the placement of an adapter and its performance. We observe that adapter location within a network significantly impacts its effectiveness, and that the optimal placement is task-dependent. To exploit this observation, we introduce an extended search space of adapter connections, including long-range and recurrent adapters. We demonstrate that even randomly selected adapter placements from this expanded space yield improved results, and that high-performing placements often correlate with high gradient rank. Our findings reveal that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.

Paper Structure

This paper contains 45 sections, 6 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: The test accuracy of a single parallel adapter for different placements. The dashed horizontal lines mark the performance of the full-fine-tuned model (pink), linear probe (black) and a setup with all 24 parallel adapters placed in every layer, both after the MHA and FFN module (cyan). The obtained results of the single adapters are affected both by the task and selected placement.
  • Figure 2: (a) The visualization of the connectivity graph $G(V,E)$ for a Transformer encoder network with $L=12$ layers (resulting in $n=25$ nodes). Each node corresponds to a hidden representation in an encoder block (denoted by the dashed lines), either after the MHA, or the FFN module. The node with the index zero represents the input to the encoder. (b) The adjacency matrix of graph $G(V,E)$ with marked search spaces for adapter placements. (c) The visualization of each studied adapter type. The block $F$ corresponds either to an MHA or FFN module.
  • Figure 3: The test accuracy obtained for the single-adapter placement for various tasks. The y-axis (rows) represents the input node index $i$, while the x-axis (columns) corresponds to the output node index $j$. The vertical lines in the color bar indicate the performance of full fine-tuning (full FT), linear probe (LP), and parallel adapters (PA). With a bright yellow line we mark the performance of the best single adapter. The plots are normalized to the minimum and maximum performance of a single adapter for the given task. The three best performing adapters are also marked by yellow blocks in the plot (see Appendix \ref{['app:type_acc']} for top accuracy for each adapter type). Note that due to high computational cost, we subsample the adjacency matrix of all possible connections.
  • Figure 4: The spearman correlation between the test accuracies of adapters location for different datasets obtained using the data from Figure \ref{['fig:adapters_full']}.
  • Figure 5: The test accuracy obtained for a given location versus the gradient rank of that location for the different datasets. We report the Spearman's correlation coefficient computed for each data (in brackets).
  • ...and 9 more figures