Semantic-Aware Representation Learning via Conditional Transport for Multi-Label Image Classification
Ren-Dong Xie, Zhi-Fen He, Bo Li, Bin Liu, Jin-Yan Hu
TL;DR
Multi-label image classification often struggles with learning discriminative semantic-aware features and aligning visual representations with label embeddings. SCT addresses this by fusing global image features with label embeddings to form semantic-related features $F^{S}$, and then refining them via bidirectional conditional transport guided by a learnable semantic map $M$ to produce region-level representations $F^{R}$. The approach optimizes a joint objective $L = L_{ ext{cls}} + \lambda_{1} L_{m} + \lambda_{2} L_{ ext{CT}}$, achieving superior mAP on VOC2007 and competitive results on MS-COCO against state-of-the-art methods. By enabling label-specific representation learning and efficient visual–semantic alignment, SCT advances multi-label recognition and shows promise for extensions to few-shot and zero-shot scenarios.
Abstract
Multi-label image classification is a critical task in machine learning that aims to accurately assign multiple labels to a single image. While existing methods often utilize attention mechanisms or graph convolutional networks to model visual representations, their performance is still constrained by two critical limitations: the inability to learn discriminative semantic-aware features, and the lack of fine-grained alignment between visual representations and label embeddings. To tackle these issues in a unified framework, this paper proposes a novel approach named Semantic-aware representation learning via Conditional Transport for Multi-Label Image Classification (SCT). The proposed method introduces a semantic-related feature learning module that extracts discriminative label-specific features by emphasizing semantic relevance and interaction, along with a conditional transport-based alignment mechanism that enables precise visual-semantic alignment. Extensive experiments on two widely-used benchmark datasets, VOC2007 and MS-COCO, validate the effectiveness of SCT and demonstrate its superior performance compared to existing state-of-the-art methods.
