SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Christoph Timmermann; Hyunse Lee; Woojin Lee

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Christoph Timmermann, Hyunse Lee, Woojin Lee

TL;DR

SeMoBridge tackles intra-modal misalignment in CLIP-based few-shot classification by mapping image embeddings into the text modality through a closed-form Semantic Modality Bridge, enabling reliable intra-class comparisons. A training-free variant achieves strong performance with minimal overhead, while SeMoBridge-T adds lightweight multi-modal supervision and class-specific biases to further boost accuracy, especially in very low-shot regimes. Across 11 datasets, the method achieves state-of-the-art or competitive results with substantially reduced training time, and demonstrates robustness to distribution shifts. The approach leverages CLIP's inter-modal semantic priors to enable efficient, scalable few-shot adaptation, with code and reproducibility resources provided.

Abstract

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

TL;DR

Abstract

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)