Table of Contents
Fetching ...

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li

TL;DR

This approach effectively addresses the issue of underutilized information in the text encoder, and proposes a method to teachs the model to re-utilize information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts.

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

TL;DR

This approach effectively addresses the issue of underutilized information in the text encoder, and proposes a method to teachs the model to re-utilize information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts.

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.
Paper Structure (47 sections, 20 equations, 10 figures, 13 tables)

This paper contains 47 sections, 20 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: (a) CLIP has two branches: a visual encoder and a text encoder. However, we find that removing certain layers of the text encoder can significantly enhance its performance in SF-CDFSL tasks. (b) Performance of 5-way 1-shot fine-tuned CLIP after removing the i-th layer (x-axis) of the text encoder. The horizontal dashed line represents the performance achieved using the full text encoder. Masking certain layers results in better performance. (c) After applying our method, the optimal performance is achieved using the full text encoder (dashed line), indicating that the lost layer no longer exists.
  • Figure 2: In SF-CDFSL tasks, (a) the lost layer is commonly present in various CLIP structures (Improvement: increased performance achieved by masking a specific layer of the text encoder), and (b) different fine-tuning methods do not effectively address this issue. Please refer to the Appendix for more detailed results. (c) Two strategies for leveraging the lost layer: Remove - eliminating the layer; Emphasize - enhancing the output of the layer using a residual approach in the final output. (d) In the source domain (ImageNet), the performance using the full text encoder (blue dashed line) is consistently optimal, thus no lost layer exists. However, after a change in the visual domain (ImageNet-R), masking the 7th layer of the text encoder significantly improves performance, indicating the reappearance of the lost layer.
  • Figure 3: (a) The overall architecture of the VtT model. First, the V-T Fusion module integrates visual and textual features at the layer level (yellow lines). Then, the TIA module absorbs information from the text encoder at the encoder level (pink lines). Finally, the DGSO module optimizes the model using the outputs from the previous modules and gradient information (orange lines). (b) The V-T Fusion module interleaves the outputs of the visual and text encoders from deep to shallow layers and integrates them using the SSM network. (c) The DGSO module removes gradients that conflict with the main task (classification) before optimizing the model. It also determines when to stop using the VtT model based on the extent of gradient conflicts.
  • Figure 4: The 5-way 5-shot results on 10 Meta-datasts triantafillou2019meta, see Appendix for the 5-way 1-shot and detailed results.
  • Figure 5: The attention maps. From left to right: the original image, the baseline result, baseline + remove (see Figure \ref{['fig:2_mthond']}) result, and the result of ours. Black boxes highlight areas of incorrect attention, while white boxes highlight missing attention. $Sim$ represents the cosine similarity between the image features and the text features. A higher similarity indicates better alignment.
  • ...and 5 more figures