Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
Dongseob Kim, Hyunjung Shim
TL;DR
This work tackles unsupervised multi-label image classification by addressing two CLIP limitations: view-dependency and bias. It introduces CCD, which first derives global pseudo-labels from CLIP, then refines them with CAM-guided local views and debiasing, followed by a consistency-regularized training regime. Empirically, CCD achieves state-of-the-art results among unsupervised methods and competes with fully supervised approaches on VOC datasets, while revealing sensitivity to dataset scale and prompting cues. The approach offers practical impact by enabling high-performance multi-label understanding without manual annotations, leveraging region-focused CLIP in a calibrated distillation framework.
Abstract
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.
