Table of Contents
Fetching ...

Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Jihwan Park, Taehoon Song, Sanghyeok Lee, Miso Choi, Hyunwoo J. Kim

TL;DR

The paper tackles the cost and fragility of adapting large vision-language models by proposing TransMiter, a lightweight, forward-only adapter that extracts adaptation knowledge from a weaker, fine-tuned model in an unsupervised manner and transfers it to a stronger target model without backpropagation. It introduces a logit-space adapter with auxiliary class expansion and a basis-change step based on Orthogonal Procrustes, enabling a closed-form, efficient transfer that preserves pre-trained knowledge while improving generalization. Experimental results across 11 datasets show TransMiter consistently surpasses zero-shot and base fine-tuning baselines, with minimal inference overhead; when combined with a small amount of labeled data (TransMiter+), it can rival or exceed supervised fine-tuning while remaining computationally lightweight. The work provides a scalable, practical approach to weak-to-strong generalization in rapidly evolving VLMs, with broad applicability to cross-model transfers and potential extensions to downstream tasks beyond visual recognition.

Abstract

Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from 'weaker' models to efficiently enhance 'stronger' ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models 'without backpropagation'. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an 'unsupervised' manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

TL;DR

The paper tackles the cost and fragility of adapting large vision-language models by proposing TransMiter, a lightweight, forward-only adapter that extracts adaptation knowledge from a weaker, fine-tuned model in an unsupervised manner and transfers it to a stronger target model without backpropagation. It introduces a logit-space adapter with auxiliary class expansion and a basis-change step based on Orthogonal Procrustes, enabling a closed-form, efficient transfer that preserves pre-trained knowledge while improving generalization. Experimental results across 11 datasets show TransMiter consistently surpasses zero-shot and base fine-tuning baselines, with minimal inference overhead; when combined with a small amount of labeled data (TransMiter+), it can rival or exceed supervised fine-tuning while remaining computationally lightweight. The work provides a scalable, practical approach to weak-to-strong generalization in rapidly evolving VLMs, with broad applicability to cross-model transfers and potential extensions to downstream tasks beyond visual recognition.

Abstract

Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from 'weaker' models to efficiently enhance 'stronger' ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models 'without backpropagation'. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an 'unsupervised' manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

Paper Structure

This paper contains 35 sections, 15 equations, 5 figures, 21 tables.

Figures (5)

  • Figure 1: Comparison of adaptation knowledge transfer methods. Performance is averaged over 11 visual recognition tasks in a base-to-novel setting. TransMiter (ours) outperforms other adaptation transfer approaches, including Prompt Transfer prompt_trans and EFT Emulator, while maintaining inference speed nearly identical to a zero-shot model (gray), which serves as the upper-bound. With a small amount of labeled data, TransMiter+ (ours+) surpasses its supervised counterpart promptsrc, e.g., Fine-tuning (red).
  • Figure 2: Overall pipeline.(a) Adaptation Knowledge Extraction. Given pre-trained $\theta_\text{pt-s}$ and fine-tuned weak VLMs $\theta_\text{ft-s}$, TransMiter captures the adaptation knowledge $\delta_\text{s}$, by minimizing the distance between the refined logits $\hat{z}_\text{s}$ and the fine-tuned weak VLM logits $z_\text{ft-s}$. The adapter takes the zero-shot model logits as input and incorporates both task classes $C_\text{task}$ and auxiliary classes $C_\text{aux}$. (b) Adaptation Knowledge Transfer. Once the strong VLM $\theta_\text{pt-t}$ is available, the mapping matrix $\hat{W}$ is computed using a closed-form solution to align the input features between the weak ($H_\text{s}$) and strong ($H_\text{t}$) VLMs, replacing the original transition matrix $W_\text{s}$ with $W_\text{t}=\hat{W}^\intercal W_\text{s}$. (c)Model Enhancement. During the inference with the target VLM using TransMiter $\theta_\text{t}^*$, the pre-trained target VLM logits $z_\text{pt-t}$ are passed through the adapter, resulting in enhanced predictions. Subsequently, as TransMiter offers a strong initial point, it can be fine-tuned with labeled data to maximize its capability.
  • Figure 3: Regularization weight in basis change.
  • Figure 4: Adaptation knowledge analysis.
  • Figure 14: Auxiliary class sampling strategy.