Table of Contents
Fetching ...

Text-Region Matching for Multi-Label Image Recognition with Missing Labels

Leilei Ma, Hongxing Xie, Lei Wang, Yanping Fu, Dengdi Sun, Haifeng Zhao

TL;DR

This work tackles multi-label image recognition with missing labels by introducing TRM-ML, a framework that enforces one-to-one text-region matching between category prompts and category-aware visual regions. It combines category-aware region learning, knowledge distillation for cross-modal alignment, multimodal category prototypes for pseudo-labels, and multimodal contrastive learning to tighten intra-class and inter-class relationships across modalities. Key contributions include transforming one-to-many or one-to-agnostic matching into robust text-region matching, designing a multimodal prototype system for pseudo-label generation, and integrating contrastive learning to bridge semantic gaps; together with strong experimental results on MS-COCO, VOC2007, VG-200, and others, TRM-ML achieves state-of-the-art performance under partial-label conditions. The approach yields practical benefits for large-scale, label-frugal MLIR tasks, enabling more reliable text-vision alignment and improved recognition with incomplete annotations.

Abstract

Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose $\textbf{T}$ext-$\textbf{R}$egion $\textbf{M}$atching for optimizing $\textbf{M}$ulti-$\textbf{L}$abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here: https://github.com/yu-gi-oh-leilei/TRM-ML.

Text-Region Matching for Multi-Label Image Recognition with Missing Labels

TL;DR

This work tackles multi-label image recognition with missing labels by introducing TRM-ML, a framework that enforces one-to-one text-region matching between category prompts and category-aware visual regions. It combines category-aware region learning, knowledge distillation for cross-modal alignment, multimodal category prototypes for pseudo-labels, and multimodal contrastive learning to tighten intra-class and inter-class relationships across modalities. Key contributions include transforming one-to-many or one-to-agnostic matching into robust text-region matching, designing a multimodal prototype system for pseudo-label generation, and integrating contrastive learning to bridge semantic gaps; together with strong experimental results on MS-COCO, VOC2007, VG-200, and others, TRM-ML achieves state-of-the-art performance under partial-label conditions. The approach yields practical benefits for large-scale, label-frugal MLIR tasks, enabling more reliable text-vision alignment and improved recognition with incomplete annotations.

Abstract

Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose ext-egion atching for optimizing ulti-abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here: https://github.com/yu-gi-oh-leilei/TRM-ML.
Paper Structure (16 sections, 15 equations, 4 figures, 6 tables)

This paper contains 16 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustrate different matching methods for textual and visual representations, including text-pixel, text-image, text-text, and our proposed text-region matching. Unlike other methods, our approach establishes a one-to-one correspondence between textual and region visual representations.
  • Figure 2: The overview of the TRM-ML framework. We freeze the visual encoder and text encoder during training phase, and only allow the category query, cross-attention, and a simple MLP to be trained.
  • Figure 3: An overview of the pseudo-label generation process. To illustrate pseudo-label estimation simply and clearly, we select one of the four color samples as an example.
  • Figure 4: Visual analysis of DualCoOp and the proposed method. For each subfigure, we present the corresponding categories of the top-3 prediction scores, with the first row from DualCoOp and the second row generated by category-aware region learning module.