Table of Contents
Fetching ...

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

Xiang Lin, Weixin Li, Shu Guo, Lihong Wang, Di Huang

TL;DR

RMAdapter addresses the challenge of adapting vision-language models in few-shot settings without losing zero-shot generalization by introducing a reconstruction-based dual-branch adapter. It combines a task-specific adaptation branch with a lightweight reconstruction branch that preserves general knowledge through local layer-wise reconstruction losses and a consistency constraint, with shared down-projection to maintain efficiency. Trained while freezing the base CLIP model, RMAdapter achieves state-of-the-art results across base-to-novel generalization, cross-dataset transfer, and domain generalization, without data augmentation or prompt redesign. The work demonstrates that a reconstruction objective plus selective architectural sharing can balance discriminability and generalization in multimodal transfer learning.

Abstract

Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

TL;DR

RMAdapter addresses the challenge of adapting vision-language models in few-shot settings without losing zero-shot generalization by introducing a reconstruction-based dual-branch adapter. It combines a task-specific adaptation branch with a lightweight reconstruction branch that preserves general knowledge through local layer-wise reconstruction losses and a consistency constraint, with shared down-projection to maintain efficiency. Trained while freezing the base CLIP model, RMAdapter achieves state-of-the-art results across base-to-novel generalization, cross-dataset transfer, and domain generalization, without data augmentation or prompt redesign. The work demonstrates that a reconstruction objective plus selective architectural sharing can balance discriminability and generalization in multimodal transfer learning.

Abstract

Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.

Paper Structure

This paper contains 20 sections, 18 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Conceptual illustration of our method. Existing adaptation methods rely on task-specific optimization objectives, which leads to the loss of generalizable knowledge (pink line). Our RMAdapter (green line) preserves generalizable knowledge through the introduction of a reconstruction loss (blue line). It guides the training trajectory toward the point between two optimal solution manifolds (green dot) while learning task-specific representations.
  • Figure 2: Structural evolution from the standard adapter and AutoEncoder to our proposed RMAdapter. Left: The internal structure of a general adapter module. The input $x$ undergoes a down projection $W_{down}$, a nonlinear activation, and an up projection $W_{up}$ to obtain $\text{Adapter}(x)$, which is then fused with the input $x$ through a residual connection. Middle: The general structure of an AutoEncoder. The input $x$ is passed through an encoder $g$ to obtain a latent representation $\text{z}$, which is then reconstructed by a decoder $f$ to produce $\hat{x}$. The reconstruction loss $argmin_{f,g}L(x,\hat{x})$ is then computed to optimize the model. Right: The general structure of our RMAdapter. Inspired by the structural similarity between adapters and AutoEncoders, RMAdapter employs a dual-branch architecture by integrating AutoEncoder branch to reconstruct input $x$ to preserve general knowledge, while the $W_{down}$ parameters are shared to further enhance the model’s performance.
  • Figure 3: The framework of our RMAdapter. RMAdapter optimizes only the additional adapters (colored parts), while the entire pre-trained CLIP model remains frozen. RMAdapter employs a dual-branch architecture consisting of: (1) an adaptation branch, $\text{RMAdapter}_{\text{base}}$ and (2) a reconstruction branch, $\text{RMAdapter}_{\text{rec}}$. Notably, the reconstruction loss is computed locally within each layer, without the need for layer-wise backpropagation or inter-layer transmission, making the computation highly efficient. Similar to previous methods, we fine-tune only the higher layers $k$ of each encoder to achieve a better balance between discriminability and generalization. The orange and green lines represent the adapted outputs, while the black lines indicate the original CLIP outputs.