Table of Contents
Fetching ...

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

Jiexuan Yan, Sheng Huang, Nankun Mu, Luwen Huangfu, Bo Liu

TL;DR

The paper tackles long-tailed multi-label image classification by introducing CPRFL, a CLIP-assisted prompt-learning framework that initializes category-prompts from text embeddings and uses a Transformer-based Visual-Semantic Interaction to decouple category-specific visual representations. A progressive Dual-Path Back-Propagation mechanism refines prompts and purifies representations, mitigating visual-semantic domain bias, while an Asymmetric Loss addresses negative-positive imbalance across all classes. The approach yields state-of-the-art results on VOC-LT and COCO-LT, with substantial gains for tail classes and robust improvements across head and medium classes as well. The method demonstrates the value of leveraging semantic correlations between categories to enhance LTMLC performance and provides a practical, scalable solution with public-code availability.

Abstract

Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at https://github.com/jiexuanyan/CPRFL.

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

TL;DR

The paper tackles long-tailed multi-label image classification by introducing CPRFL, a CLIP-assisted prompt-learning framework that initializes category-prompts from text embeddings and uses a Transformer-based Visual-Semantic Interaction to decouple category-specific visual representations. A progressive Dual-Path Back-Propagation mechanism refines prompts and purifies representations, mitigating visual-semantic domain bias, while an Asymmetric Loss addresses negative-positive imbalance across all classes. The approach yields state-of-the-art results on VOC-LT and COCO-LT, with substantial gains for tail classes and robust improvements across head and medium classes as well. The method demonstrates the value of leveraging semantic correlations between categories to enhance LTMLC performance and provides a practical, scalable solution with public-code availability.

Abstract

Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at https://github.com/jiexuanyan/CPRFL.
Paper Structure (17 sections, 7 equations, 4 figures, 3 tables)

This paper contains 17 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall framework of our CPRFL for long-tailed multi-label image classification. Overall, our approach consists of two sub-networks: Prompt Initialization (PI) network and Visual-Semantic Interaction (VSI) network. The initial prompts $P$ are extracted from CLIP's text embedding within the PI network, and then these prompts are employed to interact with visual features $F$ within the VSI network, facilitating the decoupling of category-specific visual representations $P'$. Finally, we compute the similarities between category-specific features $P'$ and corresponding prompts $P$ to obtain the prediction probability for each category and utilize a progressive Dual-Path Back-Propagation mechanism to refine the prompts. To further address the negative-positive imbalance problem inherent in multiple categories, we incorporate a Re-Weighting (RW) strategy.
  • Figure 2: The refined learning process for category-prompts and category-specific features. We employ a progressive Dual-Path Back-Propagation mechanism to refine the prompts and progressively purify the category-specific visual representations over the training iterations. The depth of color represents the accuracy of the features, and the darker the color, the higher the accuracy.
  • Figure 3: The mAP (%) performance with various types of category semantics for prompt initialization on COCO-LT dataset.
  • Figure 4: Visualization examples of Top-3 predicated categories by ResNet-50, CLIP and our CPRFL.