Table of Contents
Fetching ...

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi Zhou, Nan Du, Vincent Y. Zhao, Yuexin Wu, Bo Li, Yu Zhang, Ming-Wei Chang

TL;DR

CoDA introduces a parameter-efficient transfer learning approach that adds lightweight adapters and a learned router to enable conditional computation in pretrained models. By sparsely activating heavy computation on a small subset of tokens per layer, CoDA achieves substantial inference speedups (2x–8x) with minimal accuracy loss, while preserving the full parameter budget of the original model. The method is demonstrated across NLP, vision, and speech with ablations showing the importance of learned routing and the ability to pretrain cheaply from dense baselines. CoDA is compatible with other PETL techniques like LoRA, offering a practical path to scalable deployment of large pretrained models.

Abstract

We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

TL;DR

CoDA introduces a parameter-efficient transfer learning approach that adds lightweight adapters and a learned router to enable conditional computation in pretrained models. By sparsely activating heavy computation on a small subset of tokens per layer, CoDA achieves substantial inference speedups (2x–8x) with minimal accuracy loss, while preserving the full parameter budget of the original model. The method is demonstrated across NLP, vision, and speech with ablations showing the importance of learned routing and the ability to pretrain cheaply from dense baselines. CoDA is compatible with other PETL techniques like LoRA, offering a practical path to scalable deployment of large pretrained models.

Abstract

We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.
Paper Structure (40 sections, 15 equations, 12 figures, 8 tables)

This paper contains 40 sections, 15 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Comparison between different ways to use pretrained Transformer models, including (1) standard finetuning (left) where all parameters are tunable and computation is dense, (2) standard adapters (center) where a small set of new tunable parameters are added while the computation remains dense, and (3) CoDA (right) where the computation is sparsely activated.
  • Figure 2: CoDA significantly reduces the inference time compared to the Parallel Adapter approach he2021towards, while still maintaining parameter efficiency.
  • Figure 3: Illustration of a single CoDA layer with parallel adapter. $k$ tokens are selected and processed by the frozen pretrained Transformer layer, and all tokens are processed by the fast adapter layer.
  • Figure 4: Finetuning accuracy (y-axis) as a function of CoDA pretraining steps (x-axis). We show results using 0, 20K, 50K and 100K pretraining steps, and for reduction factor $r=3$ and $r=5$ respectively. CoDA requires as few as 20K steps to obtain competitive finetuning accuracy.
  • Figure 5: Comparison of CoDA and parallel adapter on 6 language tasks. We report results on the test set of XSum, and on the development set of other tasks. $\dagger$ indicates results taken from he2021towards, and referenced results in bracket correspond to using 2M adapter parameters. Note that our Parallel Adapter numbers are stronger as our pretrained Transformer backbone uses more parameters than the model used in he2021towards.
  • ...and 7 more figures