Table of Contents
Fetching ...

CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

Rabab Abdelfattah, Qing Guo, Xiaoguang Li, Xiaofeng Wang, Song Wang

TL;DR

CDUL tackles annotation-free multi-label image classification by leveraging CLIP to generate both global image and local snippet representations, then fusing these signals through a global-local aggregator to produce soft pseudo-labels. A gradient-alignment training procedure iteratively refines both the multi-label classifier and the pseudo-labels, enabling learning from unlabeled data without ground-truth annotations. The approach outperforms prior unsupervised methods and nears weakly supervised performance across VOC2012, VOC2007, COCO, and NUS-WIDE, demonstrating the value of explicit local-alignment signals and iterative pseudo-label refinement for multi-label semantics. The method is practical and cost-efficient, since CLIP is used only for initialization and inference relies on a lightweight classifier, making it suitable for scalable deployment on large image collections.

Abstract

This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.

CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

TL;DR

CDUL tackles annotation-free multi-label image classification by leveraging CLIP to generate both global image and local snippet representations, then fusing these signals through a global-local aggregator to produce soft pseudo-labels. A gradient-alignment training procedure iteratively refines both the multi-label classifier and the pseudo-labels, enabling learning from unlabeled data without ground-truth annotations. The approach outperforms prior unsupervised methods and nears weakly supervised performance across VOC2012, VOC2007, COCO, and NUS-WIDE, demonstrating the value of explicit local-alignment signals and iterative pseudo-label refinement for multi-label semantics. The method is practical and cost-efficient, since CLIP is used only for initialization and inference relies on a lightweight classifier, making it suitable for scalable deployment on large image collections.

Abstract

This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.
Paper Structure (19 sections, 15 equations, 7 figures, 5 tables)

This paper contains 19 sections, 15 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A comparison of our solution with fully and weakly-supervised multi-label classification. (a) The training dataset images for fully-supervised learning are fully labeled. (b) The training images used in weakly-supervised are partially labeled. (c) Our unsupervised multi-label classification method is annotation-free. (d) CLIP focuses on one class in the whole image, and the embedding is denoted by blue circle. Some classes are ignored such as "person". (e) In our approach, image snippets are mapped separately to the embedded space, where each snippet's embedding is denoted by squares. Local alignment allows to predict more labels.
  • Figure 2: Confidence scores from the off-the-shelf CLIP on sample images from COCO dataset
  • Figure 3: The overall framework for CDUL unsupervised multi-label image classification. (a) During initialization, we propose CLIP-driven global and local alignment and aggregation to generate pseudo labels. ($i$) Given an image, CLIP predicts the global similarity vector $S^{global}$; ($ii$) Given the snippets of this image, CLIP predicts local similarity vectors $S_j^{local}$; ($iii$) The global-local aggregator is used to generate the pseudo labels $S^{final}$. (b) During training, the pseudo labels generated from initialization are use to supervise the training of the classification network, using our proposed method gradient-alignment method. (c) The gradient alignment illustration shows that updating the network parameters and the pseudo labels by turns pushes both the pseudo label $y_u$ and the predicted label $y_p$ to the optimal solution to minimize the total loss function. During inference, we apply the whole image to the classification network to get the multi-label predictions.
  • Figure 4: The distributions of the predicted labels across the confidence scores using off-the-shelf CLIP on the whole image (global) and snappets (local).
  • Figure 5: Class activation maps for several examples corresponding to highest confidences for three labels on COCO dataset. The highlighted area indicates where the model focused to classify the image. Best viewed in color.
  • ...and 2 more figures