Table of Contents
Fetching ...

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Haoxian Ruan, Zhihua Xu, Zhijing Yang, Yongyi Lu, Jinghui Qin, Tianshui Chen

TL;DR

This work addresses semantic confusion in MLR task by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework and shows that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure.

Abstract

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

TL;DR

This work addresses semantic confusion in MLR task by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework and shows that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure.

Abstract

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.

Paper Structure

This paper contains 17 sections, 14 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: An illustration of MLR image with complete labels [a] and partial labels [b]. All positive and negative labels are known in traditional MLR, while some labels are missing in MLR-PL (airplane, clock, traffic light).
  • Figure 2: An overall illustration of the proposed framework. For each class, a prompt is initialized with each component as a learnable vector. These prompts are passed through CLIP textual encoder to obtain category-specific text embeddings. Meanwhile, the semantic decoupling module employs a semantic-guided spatial attention mechanism to extract fine-grained visual features for each category. The parameters of semantic decoupling module and the category-specific learnable prompts are optimized by minimizing the classification loss, while the parameters of CLIP visual and textual encoders remain fixed.
  • Figure 3: Several examples of input images and category activation maps corresponding to the top-3 highest confidence categories predicted by the proposed framework.
  • Figure 4: Comparison of the proposed semantic decoupling module and semantic attention fusion method (SA) under all label settings in the MS-COCO (left) and Pascal VOC (right) datasets.
  • Figure 5: Per-class average precision (AP) of our proposed framework and baseline method with known label proportion of 10% on the MS-COCO dataset.
  • ...and 4 more figures