Table of Contents
Fetching ...

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

Vishaal Udandarao, Ankush Gupta, Samuel Albanie

TL;DR

This work tackles the challenge of adapting vision-language models without training or target data by introducing SuS-X, a training-free name-only transfer framework. It decomposes the approach into two components: SuS, which constructs a task-specific support set from either Stable Diffusion generation or LAION-5B retrieval using only class names, and TIP-X, a training-free inference method that uses inter-modal text-based signatures and KL-divergence affinities to reweight predictions. Empirically, SuS-X achieves state-of-the-art zero-shot performance on 19 datasets across CLIP, TCL, and BLIP, and TIP-X extends to a few-shot regime to rival or surpass existing training-free baselines. The methods are complementary, data-efficient, and broadly transferable across VLM backbones, offering a practical path for rapid, scalable name-only adaptation with modest computational costs.

Abstract

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

TL;DR

This work tackles the challenge of adapting vision-language models without training or target data by introducing SuS-X, a training-free name-only transfer framework. It decomposes the approach into two components: SuS, which constructs a task-specific support set from either Stable Diffusion generation or LAION-5B retrieval using only class names, and TIP-X, a training-free inference method that uses inter-modal text-based signatures and KL-divergence affinities to reweight predictions. Empirically, SuS-X achieves state-of-the-art zero-shot performance on 19 datasets across CLIP, TCL, and BLIP, and TIP-X extends to a few-shot regime to rival or surpass existing training-free baselines. The methods are complementary, data-efficient, and broadly transferable across VLM backbones, offering a practical path for rapid, scalable name-only adaptation with modest computational costs.

Abstract

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.
Paper Structure (32 sections, 13 equations, 10 figures, 22 tables)

This paper contains 32 sections, 13 equations, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Training-free name-only transfer. We propose SuS-X, a framework for enhancing the zero-shot transfer abilities of VLMs like CLIP radford2021learning, BLIP li2022blip and TCL yang2022visiontcl, without training. To achieve this, we propose a novel method TIP-X, which adapts these VLMs using a curated support set (SuS) that is not drawn from the target distribution. Our SuS leverages one key piece of information about the task at hand: the names of the target categories.
  • Figure 2: SuS-X for training-free name-only transfer.SuS-X consists of two core building blocks. (1) SuS (top right), a dynamic support set that we construct to infuse visual information into the VLM based only on knowledge of target category names. We construct support sets either in a parametric (generating images using Stable Diffusion) or non-parametric (retrieving images from LAION-5B) manner. (2) TIP-X (bottom right), our novel training-free method that leverages image-text distances to compute similarities between the support set and the test images. These similarities act as attention weights for the support set labels, and can directly be combined with the original logits from the VLM for classification.
  • Figure 3: Our two-fold analysis motivating TIP-X
  • Figure 4: (a) Comparison of SuS-X with Zero-shot CLIP. (b) Results of training-free few-shot classification. (c) Performance comparison of SuS-X across visual backbones.
  • Figure 5: Support samples from the generated SuS-SD, retrieved SuS-LC and true training distribution for ImageNet. By randomising the image order in each subfigure, we pose a challenge question---can you match the three images for each subfigure to their source i.e.SuS-SD, SuS-LC or ImageNet train set? The answers are provided at the bottom of the page.
  • ...and 5 more figures