Table of Contents
Fetching ...

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

Dongseob Kim, Hyunjung Shim

TL;DR

This work tackles unsupervised multi-label image classification by addressing two CLIP limitations: view-dependency and bias. It introduces CCD, which first derives global pseudo-labels from CLIP, then refines them with CAM-guided local views and debiasing, followed by a consistency-regularized training regime. Empirically, CCD achieves state-of-the-art results among unsupervised methods and competes with fully supervised approaches on VOC datasets, while revealing sensitivity to dataset scale and prompting cues. The approach offers practical impact by enabling high-performance multi-label understanding without manual annotations, leveraging region-focused CLIP in a calibrated distillation framework.

Abstract

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

TL;DR

This work tackles unsupervised multi-label image classification by addressing two CLIP limitations: view-dependency and bias. It introduces CCD, which first derives global pseudo-labels from CLIP, then refines them with CAM-guided local views and debiasing, followed by a consistency-regularized training regime. Empirically, CCD achieves state-of-the-art results among unsupervised methods and competes with fully supervised approaches on VOC datasets, while revealing sensitivity to dataset scale and prompting cues. The approach offers practical impact by enabling high-performance multi-label understanding without manual annotations, leveraging region-focused CLIP in a calibrated distillation framework.

Abstract

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.

Paper Structure

This paper contains 23 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: CLIP prediction probability of horse (H) and person (P) corresponding to different input patches from single image: (a) The original image and its CLIP prediction. (b) The $1/4$ sized patch and its CLIP prediction. (c) The $1/9$ sized patch and its CLIP prediction. (d) The $1/16$ sized patch and its CLIP prediction. The same colored box indicates each patch cropped from the corresponding box.
  • Figure 2: Sample images showcasing CLIP bias. C indicates the probability of chair. (a) Top 1 probability of "tv monitor" image is 35%. (b) Top 1 probability of a "person" image is 50%. (c) Top 1 probability of a "horse" image is 100%. (d) The mean class-wise probability of PASCAL VOC 2012. We can observe the class-wise prediction bias of CLIP from these results.
  • Figure 3: The proof-of-concept study of local view selection method. We train the classifier with pseudo-label generated from four different local views: (a) Around GT boxes, (b) GT boxes, (c) Random boxes, (d) Uniform grid boxes. The numbers below each sample are the performance (mAP) of the classifier trained with the corresponding pseudo-label for the entire training set. The classifier trained with local view around GT boxes achieved 1.3%p higher performance compared to classifier trained with uniform-grid local view.
  • Figure 4: The overview of label preparation. (a) We calculate the cosine similarity between text embeddings and image embeddings. The softmax probability of these similarities is the CLIP prediction of each image. The top 1 probabilities are highlighted in green boxes. By class-wise averaging the top-1 probabilities, the CLIP bias is derived (yellow box). Pseudo-labels are then generated by debiasing CLIP predictions. (b) For label updating, classes above a threshold are selected from the classifier output, and local views corresponding to these classes are extracted. The process for acquiring local labels for each patch mirrors the initial label acquisition. The final pseudo-label is obtained through a weighted sum of initial pseudo-labels and local labels.
  • Figure 5: The training process of our method. We train the classifier using cross-entropy targeting initial pseudo-label during the warm-up phase which is illustrated as the green line. After the classifier-guided label update, we train the classifier using the cross-entropy loss targeting updated pseudo-label and the cross-entropy loss between the logits of differently augmented inputs which are illustrated as the green line and the blue dashed line.
  • ...and 6 more figures