Table of Contents
Fetching ...

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

Zhenyu Zhang, Yixiong Zou, Yuhua Li, Ruixuan Li, Guangyao Chen

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss () inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.
Paper Structure (41 sections, 18 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 18 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Unlike traditional visual models, we find that enhancing visual learning consistently decreases the performance of the VLM-based CDFSL, although the visual-modal accuracy increases. Appropriately inhibiting visual learning improves cross-modal performance. (b) We also find that this phenomenon widely exists in cross-domain scenarios, where the best-performing zero-shot CLIP model does not necessarily extract the best visual features. (c) In this paper, we explore the phenomenon for an interpretation, finding that fine-tuning involves two learning directions: the visual and the cross-modal ones. We interpret visual learning as a shortcut that reduces the VLM's classification loss but disrupts cross-modal learning. (d) Based on this, we propose a method to suppress and guide visual learning, enhancing the model's cross-modal learning.
  • Figure 2: (a) When sample $i$ and sample $k$ belong to different classes, $\Delta \cos(\theta_{ik})$ in 5-way 1-shot fine-tuning is always less than 0. (b) At any stage of fine-tuning (each epoch), visual learning can effectively reduce the loss value ($\mathcal{L}_{\mathrm{vlm}}$).
  • Figure 3: Recording the model at each epoch during training and further training these models using visual learning, the target loss value $\mathcal{L}_{\mathrm{vlm}}$ can be effectively reduced even further.
  • Figure 4: We employ three strategies to restrict visual learning during the VLM fine-tuning process and measure the distance of the resulting models from the optimal cross-modal model. We find that visual learning increases the distance from the optimal model, which hinders visual-text alignment.
  • Figure 5: (a) Gap shift operation liang2022mind. (b) The model's initial state is marked by the vertical dashed line. In the cross-domain dataset EuroSAT, CLIP's modalities are misaligned, meaning that through gap shift, lower loss and higher accuracy can be achieved. (c) Fine-tuning does not effectively re-align the modalities. (d) When visual learning is enhanced ($\mathcal{L}_{\mathrm{v}}$), the misalignment becomes more pronounced. (e) When visual learning is suppressed ($\mathcal{L}_{\mathrm{ad}}$), the misalignment is mitigated.
  • ...and 8 more figures