Table of Contents
Fetching ...

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Salim Khazem

Abstract

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Abstract

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune
Paper Structure (29 sections, 2 theorems, 9 equations, 7 figures, 6 tables)

This paper contains 29 sections, 2 theorems, 9 equations, 7 figures, 6 tables.

Key Result

theorem 1

Under Assumption. (ass:lowrank), let $\Delta_r^\star$ denote the best rank-$r$ approximation of $\Delta^\star$ (obtained by truncated SVD at rank $r$). There exist adapter parameters $\{W^{\mathrm{up}}, W^{\mathrm{down}}\}$ such that the adapter $A$ satisfies, for any $h$ with $\|h\|_2 \le B$, Moreover, if the downstream loss is $L_\ell$-Lipschitz in the logits and the classifier head is $L_g$-Li

Figures (7)

  • Figure 1: AdapterTune architecture.(Left) Trainable residual adapters (orange) are inserted into the strictly frozen Vision Transformer backbone (blue). (Right) The adapter uses a low-rank bottleneck where the up-projection is zero-initialized. This guarantees an initial zero output ($A_\ell(h_\ell) = 0$), acting as an exact identity mapping to prevent early-epoch optimization drift.
  • Figure 2: Per-dataset accuracy comparison. Each row corresponds to one dataset. Gray circles: Head-Only; blue squares: Full Fine-Tune; red stars: AdapterTune. Connecting lines show the performance gap bridged by each method. AdapterTune (red stars) reaches or surpasses full fine-tuning on most datasets, while using only 0.92% of its parameters.
  • Figure 3: Accuracy versus trainable parameter count (Pareto frontier). AdapterTune (red stars) achieves comparable or higher accuracy than full fine-tuning (blue squares) at 1-2 orders of magnitude fewer trainable parameters, demonstrating a clearly favourable position on the accuracy-efficiency frontier.
  • Figure 4: (Left) Rank sweep across all core datasets and backbone scales. The diminishing-returns elbow (predicted by \ref{['cor:diminishing']}) appears consistently across every dataset-backbone pair, not just CIFAR-10/ViT-S. Accuracy gains from $r\!=\!8\!\to\!32$ uniformly exceed gains from $r\!=\!32\!\to\!64$, validating the $\mathcal{O}(r^{1/2-p})$ decay law broadly. (Right) Rank $\times$ adapter scale ($\alpha$) joint sensitivity on CIFAR-10/ViT-S. Accuracy is robust across the full $r\!\in\![8,64]$ range for $\alpha\!\le\!1$. Only $\alpha\!=\!2$ at low rank causes a visible drop, confirming $\alpha\!=\!1$ as a safe default that need not be tuned.
  • Figure 5: Backbone scaling trends. (Left) Average gain of AdapterTune over Head-Only across all datasets as backbone parameter count grows from DeiT-T/16 (5M) to ViT-B/16 (86M). The gains are consistent across backbone scales, with larger backbones showing slightly higher gains on fine-grained tasks. (Right) Average gain of AdapterTune over Full Fine-Tuning: the adapter advantage over full fine-tuning increases with backbone size, attributed to the stronger implicit regularization of the low-rank constraint as model capacity grows.
  • ...and 2 more figures

Theorems & Definitions (3)

  • theorem 1: Approximation by rank-$r$ adapters
  • corollary 1: Diminishing returns
  • proof