CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels
Shizhuo Deng, Bowen Han, Jiaqi Chen, Hao Wang, Dongyue Chen, Tong Jia
TL;DR
This work tackles the challenge of noisy labels in CLIP-based few-shot learning by introducing CRoF, a plug-in framework that combines a task-oriented prompt generator, fine-tuned CLIP with a residual-adapter, and a top-$K$ soft-label weighting mechanism. The task-oriented prompts enlarge inter-class distances in textual embeddings, while the weighting scheme balances the original and CLIP-produced labels under label noise, guided by hyperparameters and similarity rankings. Across diverse datasets and noise conditions, CRoF consistently improves robustness and generalization over standard CLIP-based fine-tuning methods, often yielding substantial gains in high-noise scenarios. The approach offers a practical, modular enhancement for real-world FSL where label quality is variable, leveraging CLIP priors and multi-label soft supervision to mitigate misclassification from noisy data.
Abstract
Noisy labels threaten the robustness of few-shot learning (FSL) due to the inexact features in a new domain. CLIP, a large-scale vision-language model, performs well in FSL on image-text embedding similarities, but it is susceptible to misclassification caused by noisy labels. How to enhance domain generalization of CLIP on noisy data within FSL tasks is a critical challenge. In this paper, we provide a novel view to mitigate the influence of noisy labels, CLIP-based Robust Few-shot learning (CRoF). CRoF is a general plug-in module for CLIP-based models. To avoid misclassification and confused label embedding, we design the few-shot task-oriented prompt generator to give more discriminative descriptions of each category. The proposed prompt achieves larger distances of inter-class textual embedding. Furthermore, rather than fully trusting zero-shot classification by CLIP, we fine-tune CLIP on noisy few-shot data in a new domain with a weighting strategy like label-smooth. The weights for multiple potentially correct labels consider the relationship between CLIP's prior knowledge and original label information to ensure reliability. Our multiple label loss function further supports robust training under this paradigm. Comprehensive experiments show that CRoF, as a plug-in, outperforms fine-tuned and vanilla CLIP models on different noise types and noise ratios.
