Table of Contents
Fetching ...

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal

TL;DR

NoLA addresses the gap where CLIP underperforms on fine-grained tasks by uniting LLM-derived class descriptions with a DINO-based pseudo-labeling mechanism and prompt-tuned CLIP vision encoding. The method auto-labels unlabeled image collections through a CDE classifier and a DINO-aligned labeling network, then uses DINO supervision to perform lightweight prompt tuning on CLIP's vision branch. Across 11 diverse datasets, NoLA achieves an average gain of 3.6% over the previous state-of-the-art LaFTer in a label-free setting and attains state-of-the-art results on 9 of 11 datasets, underscoring its practical impact for scalable, label-efficient vision-language adaptation. By leveraging unlabeled data and the complementary strengths of LLMs, SSL backbones, and prompt learning, NoLA offers a robust pathway to customize foundation models for fine-grained recognition without costly annotations.

Abstract

In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings & DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual & textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer across 11 diverse image classification datasets. Our code & models can be found at https://github.com/fazliimam/NoLA.

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

TL;DR

NoLA addresses the gap where CLIP underperforms on fine-grained tasks by uniting LLM-derived class descriptions with a DINO-based pseudo-labeling mechanism and prompt-tuned CLIP vision encoding. The method auto-labels unlabeled image collections through a CDE classifier and a DINO-aligned labeling network, then uses DINO supervision to perform lightweight prompt tuning on CLIP's vision branch. Across 11 diverse datasets, NoLA achieves an average gain of 3.6% over the previous state-of-the-art LaFTer in a label-free setting and attains state-of-the-art results on 9 of 11 datasets, underscoring its practical impact for scalable, label-efficient vision-language adaptation. By leveraging unlabeled data and the complementary strengths of LLMs, SSL backbones, and prompt learning, NoLA offers a robust pathway to customize foundation models for fine-grained recognition without costly annotations.

Abstract

In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings & DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual & textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer across 11 diverse image classification datasets. Our code & models can be found at https://github.com/fazliimam/NoLA.

Paper Structure

This paper contains 28 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Top-1 accuracy (%) comparison with recent label-free method on 11 diverse image classification datasets. NoLA (Ours) achieves state-of-the-art performance in 9 out of 11 datasets, outperforming the state-of-the-art LaFter by an average absolute gain of 3.6%.
  • Figure 2: Overview of proposed NoLA (No Labels Attached) method,(a) A set of templates and the class names are fed through an LLM to generate context-enriched text descriptions per class. The description embeddings obtained from the CLIP text encoder $f_{t}$ are averaged to compose the class description-based embedding (CDE) classifier $\phi$. (b) Zero-shot inference is obtained for the training set by using the CLIP vision encoder $f_{v}$ and the CDE classifier. From the predictions, top-k confident training samples are selected to train the alignment module $h$ which utilizes a self-supervised learned (SSL) $g_{s}$ backbone (DINO). (c) The DINO-based labelling network consisting of the alignment module $h$ is then used to generate pseudo labels and learn dataset specific visual prompts which are prepended to the frozen CLIP vision encoder.
  • Figure 3: TSNE projections comparison of EuroSAT embeddings obtained from the base CLIP ViT-B/32 (left) and CLIP ViT-B/32 adapted with our NoLA framework (right).
  • Figure 4: Top-1 Accuracy of trained DL network with different values for $k$. The top row shows the performance of different values for $k$ in large datasets, namely, ImageNet (top-left) and CIFAR-100 (top-right). The bottom row shows the performance of different values for $k$ in small datasets, namely, UCF101 (bottom-left) and Caltech101 (bottom-right). The value inside the parentheses on the x-axis represents the number of pseudo labels selected according to the specified percentage.