Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting
Linhai Zhuo, Zheng Wang, Yuqian Fu, Tianwen Qian
TL;DR
This work tackles source-free cross-domain few-shot learning (CD-FSL) by introducing SeGD-VPT, a semantic-guided diversity visual prompt tuning framework built on CLIP. It combines three pillars: visual diversity prompts added at the input, textual describe prompts guided by semantic descriptions, and cross-modal alignment via targeted contrastive losses, all while keeping the CLIP backbone frozen. The approach is trained in two steps—first to generate diverse, semantically grounded prompt features, then to train a classifier on these features using Arcface loss—yielding strong performance on BSCD benchmarks and establishing new state-of-the-art results under source-free CD-FSL. These results demonstrate the value of leveraging textual modality to enrich input diversity and improve transferability of large-scale pretrained models to diverse target domains with minimal data, reducing the need for source-domain data in practical deployments.
Abstract
The source-free cross-domain few-shot learning (CD-FSL) task aims to transfer pretrained models to target domains utilizing minimal samples, eliminating the need for source domain data. Addressing this issue requires models to have robust generalization abilities and strong feature representation, aligning with the characteristics of large-scale pretrained models. However, large-scale models tend to lose representational ability in cross-domain scenarios due to limited sample diversity. \zlh{Given the abundant diversity provided by semantic modality, this paper leverages textual modality to enhance training sample diversity with CLP model}, meanwhile improving model transfer efficiency. Specifically, we propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity. Furthermore, we use diversity descriptions of classes to guide semantically meaningful learning of diversity prompts, proposing random combinations and selections of texts to increase textual diversity. Additionally, deep prompt tuning is introduced to enhance the model's transfer capability. After training of the first step, support samples with different diversity prompts are input into the CLIP backbone to generate enhanced features. After generation, the second phase trains classifiers using the generated features. Extensive experimental results across several benchmarks verify our method is comparable to SOTA source-utilized models and attain the best performance under the source-free CD-FSL setting.
