Table of Contents
Fetching ...

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

Xi Yang, Pai Peng, Wulin Xie, Xiaohuan Lu, Jie Wen

TL;DR

The paper tackles the modality gap between image and text features in vision-language models for few-shot image classification. It proposes Cross-Modal Mapping (CMM), a simple framework that globally aligns image features to the text space via a residual linear transform and locally refines alignment with a triplet loss, enabling text prototypes to serve as robust class representatives. By fusing CMM logits with CLIP’s inter-modal logits, the method preserves zero-shot capabilities while delivering improved accuracy and efficiency, especially in data-scarce settings. Empirical results across 11 standard datasets and four distribution-shift benchmarks demonstrate an average Top-1 gain of about $1.06\%$, with strong generalization and a scalable, low-complexity training footprint, highlighting the practical value of linear cross-modal alignment for few-shot learning.

Abstract

Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

TL;DR

The paper tackles the modality gap between image and text features in vision-language models for few-shot image classification. It proposes Cross-Modal Mapping (CMM), a simple framework that globally aligns image features to the text space via a residual linear transform and locally refines alignment with a triplet loss, enabling text prototypes to serve as robust class representatives. By fusing CMM logits with CLIP’s inter-modal logits, the method preserves zero-shot capabilities while delivering improved accuracy and efficiency, especially in data-scarce settings. Empirical results across 11 standard datasets and four distribution-shift benchmarks demonstrate an average Top-1 gain of about , with strong generalization and a scalable, low-complexity training footprint, highlighting the practical value of linear cross-modal alignment for few-shot learning.

Abstract

Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.
Paper Structure (20 sections, 18 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 18 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: T-SNE Maaten_Hinton_2008 Visualization: Triangles represent textual feature embeddings for each category, dots indicate image feature embeddings, and different colors distinguish various categories.
  • Figure 2: CMM demonstrates competitive performance across 11 datasets, with values reflecting the average top-1 accuracy under 1, 2, 4, 8, and 16-shot conditions compared to other methods.
  • Figure 3: Performance comparison of CMM, TIP zhang2021tip, and Cross-Modal Linear/Partial lin2023multimodality on ImageNet. CMM outperforms partial fine-tuning approaches.
  • Figure 4: Cross-Modal Mapping (CMM) Architecture. Without the need to construct a visual cache, CMM optimizes only the linear transformation matrix $W$ and the textual feature matrix $T$, simplifying the training process and effectively reducing the modality gap. This reduction in the modality gap narrows the search space for the fusion parameter $\alpha$, thereby enhancing inference efficiency. $\mathrm{Logits_{CLIP}}$ represents CLIP's inter-modal classification, while $\mathrm{Logits_{CMM}}$ introduces a cross-modal inductive bias to further enhance classification capability.
  • Figure 5: Triplet Loss: The anchor is the image feature, the positive is the corresponding textual feature, and the negative is the closest incorrect textual feature.
  • ...and 9 more figures