Table of Contents
Fetching ...

Adapters Strike Back

Jan-Martin O. Steitz, Stefan Roth

TL;DR

Adapters can powerfully adapt vision transformers with minimal parameter overhead, but prior results were inconsistent due to implementation choices. This work systematically analyzes adapter positions, inner structure, initializations, and data normalization, introducing Adapter+ with a learnable channel-wise scaling and Post-Adapter placement. Adapter+ achieves state-of-the-art VTAB performance (77.6% avg) without per-task hyperparameter optimization and reaches 90.7% on FGVC with a small parameter budget, outperforming more complex methods. The findings offer practical guidance for robust, scalable transfer learning of ViTs across diverse visual tasks and pretraining regimes.

Abstract

Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.

Adapters Strike Back

TL;DR

Adapters can powerfully adapt vision transformers with minimal parameter overhead, but prior results were inconsistent due to implementation choices. This work systematically analyzes adapter positions, inner structure, initializations, and data normalization, introducing Adapter+ with a learnable channel-wise scaling and Post-Adapter placement. Adapter+ achieves state-of-the-art VTAB performance (77.6% avg) without per-task hyperparameter optimization and reaches 90.7% on FGVC with a small parameter budget, outperforming more complex methods. The findings offer practical guidance for robust, scalable transfer learning of ViTs across diverse visual tasks and pretraining regimes.

Abstract

Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.
Paper Structure (24 sections, 12 equations, 3 figures, 14 tables)

This paper contains 24 sections, 12 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Parameter-accuracy characteristics of adaptation methods on the VTAB Zhai:2020:LSRtest sets. We report original results and re-evaluations ($\circlearrowright$) after a complete training schedule with suitable data normalization. Our Adapter+ has clearly the best parameter-accuracy trade-off. The vertical, dashed line shows the possible minimal number of tunable parameters when only the classifiers are trained, i.e., using linear probing (61% accuracy).
  • Figure 2: Average accuracy for VTAB subgroups on the test sets. For methods marked with $\circlearrowright$, we report results of our re-evaluation after a complete training schedule with suitable data normalization to ensure a fair comparison. Adapter+ is evaluated with rank $r\!\in\![1..32]$.
  • Figure 3: Illustrations of (a) the inner structure of an adapter with feed-forward layers (FF), activation layer (Act), and optional layer normalization (LN) and scaling, (b)--(d) different possible adapter positions to connect the adapter to the FFN section of the transformer layer. Modules with trainable parameters are shown in red and frozen modules in blue.