AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Salim Khazem

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Salim Khazem

Abstract

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Abstract

Paper Structure (29 sections, 2 theorems, 9 equations, 7 figures, 6 tables)

This paper contains 29 sections, 2 theorems, 9 equations, 7 figures, 6 tables.

Introduction
Related Work
Method
Preliminaries
Residual Adapter Module
Placement.
Zero-Initialization for Stable Optimization
Trainable Parameter Count
Training Objective and Protocol
Comparison Regimes
Theoretical Analysis
Setup and Assumptions
Approximation Bound
Proof sketch.
Diminishing Returns with Rank
...and 14 more sections

Key Result

theorem 1

Under Assumption. (ass:lowrank), let $\Delta_r^\star$ denote the best rank-$r$ approximation of $\Delta^\star$ (obtained by truncated SVD at rank $r$). There exist adapter parameters $\{W^{\mathrm{up}}, W^{\mathrm{down}}\}$ such that the adapter $A$ satisfies, for any $h$ with $\|h\|_2 \le B$, Moreover, if the downstream loss is $L_\ell$-Lipschitz in the logits and the classifier head is $L_g$-Li

Figures (7)

Figure 1: AdapterTune architecture.(Left) Trainable residual adapters (orange) are inserted into the strictly frozen Vision Transformer backbone (blue). (Right) The adapter uses a low-rank bottleneck where the up-projection is zero-initialized. This guarantees an initial zero output ($A_\ell(h_\ell) = 0$), acting as an exact identity mapping to prevent early-epoch optimization drift.
Figure 2: Per-dataset accuracy comparison. Each row corresponds to one dataset. Gray circles: Head-Only; blue squares: Full Fine-Tune; red stars: AdapterTune. Connecting lines show the performance gap bridged by each method. AdapterTune (red stars) reaches or surpasses full fine-tuning on most datasets, while using only 0.92% of its parameters.
Figure 3: Accuracy versus trainable parameter count (Pareto frontier). AdapterTune (red stars) achieves comparable or higher accuracy than full fine-tuning (blue squares) at 1-2 orders of magnitude fewer trainable parameters, demonstrating a clearly favourable position on the accuracy-efficiency frontier.
Figure 4: (Left) Rank sweep across all core datasets and backbone scales. The diminishing-returns elbow (predicted by \ref{['cor:diminishing']}) appears consistently across every dataset-backbone pair, not just CIFAR-10/ViT-S. Accuracy gains from $r\!=\!8\!\to\!32$ uniformly exceed gains from $r\!=\!32\!\to\!64$, validating the $\mathcal{O}(r^{1/2-p})$ decay law broadly. (Right) Rank $\times$ adapter scale ($\alpha$) joint sensitivity on CIFAR-10/ViT-S. Accuracy is robust across the full $r\!\in\![8,64]$ range for $\alpha\!\le\!1$. Only $\alpha\!=\!2$ at low rank causes a visible drop, confirming $\alpha\!=\!1$ as a safe default that need not be tuned.
Figure 5: Backbone scaling trends. (Left) Average gain of AdapterTune over Head-Only across all datasets as backbone parameter count grows from DeiT-T/16 (5M) to ViT-B/16 (86M). The gains are consistent across backbone scales, with larger backbones showing slightly higher gains on fine-grained tasks. (Right) Average gain of AdapterTune over Full Fine-Tuning: the adapter advantage over full fine-tuning increases with backbone size, attributed to the stronger implicit regularization of the low-rank constraint as model capacity grows.
...and 2 more figures

Theorems & Definitions (3)

theorem 1: Approximation by rank-$r$ adapters
corollary 1: Diminishing returns
proof

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Abstract

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)