Table of Contents
Fetching ...

Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning

Heng-Bo Fan, Ming-Kun Xie, Jia-Hao Xiao, Sheng-Jun Huang

TL;DR

This work tackles semi-supervised multi-label learning (SSMLL) where labeled data are scarce by leveraging vision-language models, notably CLIP, to improve pseudo-labels for unlabeled images. The authors propose a context-based semantic-aware alignment framework that extracts label-specific image features and aligns them with class text embeddings, turning a complex many-to-one problem into manageable one-to-one alignments. A semi-supervised context identification auxiliary task captures label co-occurrence by partitioning the label space into contexts via spectral clustering and learning with labeled and unlabeled data. Across COCO, VOC, and NUS-WIDE, the method achieves state-of-the-art results, especially at low supervision levels, demonstrating the effectiveness of context modeling and label-specific alignment for SSMLL and highlighting the practical impact of leveraging VLMs for this challenging setting.

Abstract

Due to the lack of extensive precisely-annotated multi-label data in real word, semi-supervised multi-label learning (SSMLL) has gradually gained attention. Abundant knowledge embedded in vision-language models (VLMs) pre-trained on large-scale image-text pairs could alleviate the challenge of limited labeled data under SSMLL setting.Despite existing methods based on fine-tuning VLMs have achieved advances in weakly-supervised multi-label learning, they failed to fully leverage the information from labeled data to enhance the learning of unlabeled data. In this paper, we propose a context-based semantic-aware alignment method to solve the SSMLL problem by leveraging the knowledge of VLMs. To address the challenge of handling multiple semantics within an image, we introduce a novel framework design to extract label-specific image features. This design allows us to achieve a more compact alignment between text features and label-specific image features, leading the model to generate high-quality pseudo-labels. To incorporate the model with comprehensive understanding of image, we design a semi-supervised context identification auxiliary task to enhance the feature representation by capturing co-occurrence information. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our proposed method.

Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning

TL;DR

This work tackles semi-supervised multi-label learning (SSMLL) where labeled data are scarce by leveraging vision-language models, notably CLIP, to improve pseudo-labels for unlabeled images. The authors propose a context-based semantic-aware alignment framework that extracts label-specific image features and aligns them with class text embeddings, turning a complex many-to-one problem into manageable one-to-one alignments. A semi-supervised context identification auxiliary task captures label co-occurrence by partitioning the label space into contexts via spectral clustering and learning with labeled and unlabeled data. Across COCO, VOC, and NUS-WIDE, the method achieves state-of-the-art results, especially at low supervision levels, demonstrating the effectiveness of context modeling and label-specific alignment for SSMLL and highlighting the practical impact of leveraging VLMs for this challenging setting.

Abstract

Due to the lack of extensive precisely-annotated multi-label data in real word, semi-supervised multi-label learning (SSMLL) has gradually gained attention. Abundant knowledge embedded in vision-language models (VLMs) pre-trained on large-scale image-text pairs could alleviate the challenge of limited labeled data under SSMLL setting.Despite existing methods based on fine-tuning VLMs have achieved advances in weakly-supervised multi-label learning, they failed to fully leverage the information from labeled data to enhance the learning of unlabeled data. In this paper, we propose a context-based semantic-aware alignment method to solve the SSMLL problem by leveraging the knowledge of VLMs. To address the challenge of handling multiple semantics within an image, we introduce a novel framework design to extract label-specific image features. This design allows us to achieve a more compact alignment between text features and label-specific image features, leading the model to generate high-quality pseudo-labels. To incorporate the model with comprehensive understanding of image, we design a semi-supervised context identification auxiliary task to enhance the feature representation by capturing co-occurrence information. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our proposed method.

Paper Structure

This paper contains 31 sections, 12 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: In MLL, context information determines the co-occurrence relationships among categories. Semantics belonging to the same context are more likely to appear simultaneously. To leverage this, we introduce an auxiliary task: Context Identification, during model training. This task helps narrow down the label space into its relevant subset.
  • Figure 2: An overview of our proposed method. We begin by employing CLIP to extract both image features and $C$-class text features. Subsequently, we thoroughly leverage the pre-training knowledge embedded in the text encoder of CLIP to extract label-specific image features. To further exploit label correlation, we introduce a Context Identification task. Finally, we optimize the network by generating pseudo-labels with class-wise thresholds.
  • Figure 3: A comparison between previous alignment task and ours. (a) Previous methods align multiple text embeddings with one whole image embedding. Instead, (b) our method extracts label-specific image features and aligns them with the corresponding class text features.
  • Figure 4: Visualization of images within various contexts that have top prediction probability.
  • Figure 5: Context clusters generated by spectral clustering based on correlation matrix of COCO.
  • ...and 2 more figures