Table of Contents
Fetching ...

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Christoph Timmermann, Hyunse Lee, Woojin Lee

TL;DR

SeMoBridge tackles intra-modal misalignment in CLIP-based few-shot classification by mapping image embeddings into the text modality through a closed-form Semantic Modality Bridge, enabling reliable intra-class comparisons. A training-free variant achieves strong performance with minimal overhead, while SeMoBridge-T adds lightweight multi-modal supervision and class-specific biases to further boost accuracy, especially in very low-shot regimes. Across 11 datasets, the method achieves state-of-the-art or competitive results with substantially reduced training time, and demonstrates robustness to distribution shifts. The approach leverages CLIP's inter-modal semantic priors to enable efficient, scalable few-shot adaptation, with code and reproducibility resources provided.

Abstract

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

TL;DR

SeMoBridge tackles intra-modal misalignment in CLIP-based few-shot classification by mapping image embeddings into the text modality through a closed-form Semantic Modality Bridge, enabling reliable intra-class comparisons. A training-free variant achieves strong performance with minimal overhead, while SeMoBridge-T adds lightweight multi-modal supervision and class-specific biases to further boost accuracy, especially in very low-shot regimes. Across 11 datasets, the method achieves state-of-the-art or competitive results with substantially reduced training time, and demonstrates robustness to distribution shifts. The approach leverages CLIP's inter-modal semantic priors to enable efficient, scalable few-shot adaptation, with code and reproducibility resources provided.

Abstract

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.

Paper Structure

This paper contains 33 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Comparison of average Accuracy against Training Time of few-shot image classification methods on 11 datasets. Our proposed trained SeMoBridge-T achieves better accuracy using only a fraction of the time.
  • Figure 2: Left: Illustration of the modality gap, intra-modal misalignment, and our proposed Semantic Modality Bridge (SeMoBridge). Due to intra-modal misalignment, query images can be embedded closer to the wrong class. SeMoBridge addresses this by applying a single unified projection that maps image embeddings into the text modality, preserving their semantics and enabling more accurate comparison. Right: Confusion matrices on a subset of 10 classes from the OxfordPets dataset, comparing intra-modal and our bridged inter-modal approach. Each matrix shows how query images are classified with respect to the few-shot support classes. SeMoBridge substantially reduces class confusion by enabling more reliable comparisons.
  • Figure 3: Overall architecture of our method. Left: At inference time, SeMoBridge maps both query and few-shot images into the text modality. The resulting pseudo-EOS tokens are passed through CLIP’s text projection layer, enabling robust inter-modal comparisons. Classification is performed by blending three logits: CLIP's Zero-Shot Prior, Original Few-Shots vs. Bridged Query, and Original Query vs. Bridged Few-Shots. Right: SeMoBridge-T is supervised from both images and texts. Three primary loss terms are used: image alignment, encoded text alignment, and projected text alignment. Only the SeMoBridge parameters are updated, and all CLIP components remain frozen.
  • Figure 4: Few-shot accuracy of training-free SeMoBridge against other methods with ViT-B/16.
  • Figure 5: Few-shot accuracy of trained SeMoBridge-T against other methods with ViT-B/16.
  • ...and 7 more figures