Table of Contents
Fetching ...

Noise-Tolerant Few-Shot Unsupervised Adapter for Vision-Language Models

Eman Ali, Muhammad Haris Khan

TL;DR

NtUA tackles unsupervised adaptation of vision-language models to target domains with limited unlabeled data by building a weighted key-value cache of CLIP features and pseudo-labels, where weights reflect pseudo-label confidence. It introduces two stages—adaptive cache formation and knowledge-guided cache refinement—using CLIP-distilled predictions from a larger model (ViT-L/14) to update cache values and weights, complemented by a prototype-affinity loss to emphasize reliable signals. Across 11 datasets in a 16-shot setting, NtUA consistently outperforms zero-shot CLIP and several unsupervised baselines, while remaining computationally efficient. This work enables scalable, noise-tolerant adaptation of vision-language models in real-world tasks with scarce labeled data, including domains like medical imaging and specialized translation, where labeled data is hard to obtain.

Abstract

Recent advances in large-scale vision-language models have achieved impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability and generalizability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is knowledge-guided cache refinement, which refines pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.

Noise-Tolerant Few-Shot Unsupervised Adapter for Vision-Language Models

TL;DR

NtUA tackles unsupervised adaptation of vision-language models to target domains with limited unlabeled data by building a weighted key-value cache of CLIP features and pseudo-labels, where weights reflect pseudo-label confidence. It introduces two stages—adaptive cache formation and knowledge-guided cache refinement—using CLIP-distilled predictions from a larger model (ViT-L/14) to update cache values and weights, complemented by a prototype-affinity loss to emphasize reliable signals. Across 11 datasets in a 16-shot setting, NtUA consistently outperforms zero-shot CLIP and several unsupervised baselines, while remaining computationally efficient. This work enables scalable, noise-tolerant adaptation of vision-language models in real-world tasks with scarce labeled data, including domains like medical imaging and specialized translation, where labeled data is hard to obtain.

Abstract

Recent advances in large-scale vision-language models have achieved impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability and generalizability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is knowledge-guided cache refinement, which refines pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
Paper Structure (9 sections, 6 equations, 5 figures, 5 tables)

This paper contains 9 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison of pseudo-label accuracy across different shots within the DTD dataset illustrates that accuracy tends to increase as the number of samples grows.
  • Figure 2: Unlike key-value cache from labelled samples in supervised methods zhang2022tiporhan2018simple, we build weighted key-value cache from unlabelled samples, where the cache weights are determined by the confidence of the pseudo-labels predicted by large-scale vision-language models. The adaptive weighting mechanism makes the unsupervised adaptation more tolerant to noisy pseudo-labels.
  • Figure 3: The framework of Noise-Tolerant Unsupervised Adapter (NtUA): (a) At Stage I, NtUA first conducts adaptive cache formation by constructing a weighted key-value cache to store the knowledge of few-shot unlabelled target samples and then applies knowledge-guided cache refinement to rectify both cache values and cache weights. In the cache, the image features extracted with CLIP's visual encoder $E_{v}$ serve as the keys, the CLIP-predicted pseudo-labels (generated using $E_{v}$ and CLIP's textual encoder $E_{t}$) serve as the values. The corresponding pseudo-label prediction confidence serves as the weights of the key-value pairs. To perform knowledge-guided cache refinement, NtUA generates CLIP-distilled predictions (with CLIP's visual encoder $E_{v}^{kd}$ and textual encoder $E_{t}$) and leverages such CLIP-distilled knowledge to update both values and weights in the cache. (b) In Stage II, NtUA updates the keys in the constructed weighted key-value cache by incorporating knowledge from both the cache and CLIP.
  • Figure 4: The correct and incorrect pseudo-labels distribution based on (a) the confidence generated from ViT-B/32. (b) The confidence generated from ViT-L/14 (c) the prototype-affinity weights $\omega$ generated from ViT-L/14. All experiments are done on a 16-shot sample from the Caltech101 training set.
  • Figure 5: Comparison of Training Times: NtUA versus Five Baseline Methods using 16 Unlabelled Samples from ImageNet Dataset