Table of Contents
Fetching ...

Query2Label: A Simple Transformer Way to Multi-Label Classification

Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, Jun Zhu

TL;DR

Query2Label introduces a simple Transformer-based, two-stage framework for multi-label classification that treats each label as a learnable query and uses cross-attention to pool label-specific features from a backbone. Label embeddings are iteratively updated by multi-head attention to produce per-label representations, followed by a per-label projection to probabilities, trained with an asymmetric loss to handle imbalance. The approach delivers state-of-the-art results on MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome, including a reported 91.3% mAP on MS-COCO, and provides interpretable attention maps that localize label-specific regions. Its backbone-agnostic design and strong empirical performance establish a robust, simple baseline for future multi-label classification work.

Abstract

This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.

Query2Label: A Simple Transformer Way to Multi-Label Classification

TL;DR

Query2Label introduces a simple Transformer-based, two-stage framework for multi-label classification that treats each label as a learnable query and uses cross-attention to pool label-specific features from a backbone. Label embeddings are iteratively updated by multi-head attention to produce per-label representations, followed by a per-label projection to probabilities, trained with an asymmetric loss to handle imbalance. The approach delivers state-of-the-art results on MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome, including a reported 91.3% mAP on MS-COCO, and provides interpretable attention maps that localize label-specific regions. Its backbone-agnostic design and strong empirical performance establish a robust, simple baseline for future multi-label classification work.

Abstract

This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.

Paper Structure

This paper contains 21 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of Query2Label. Using cross attention for adaptively feature pooling through focusing on different parts (best view in colors).
  • Figure 2: The framework of our proposed Query2Label (Q2L). After extracting spatial features of an input image, each label embedding is sent to Transformer decoders to query (by comparing the label embedding with features at each spatial location to generate attention maps) and pool the desired feature adaptively (by linearly combining the spatial features based on the attention maps). The pooled feature is then used to predict the existence of the queried label.
  • Figure 3: Visualization of cross-attention maps. We plot the mean of each head's cross-attention maps, that represent similarities of a given query and extracted spatial features. Texts above images represent the ground truth labels (query) for the raw images. Best view in colors.
  • Figure 4: Image examples classified correctly by Q2L but wrongly by the baseline TResNetL. The middle two columns are the mean attention maps of Q2L and the enlarged maps on focused regions respectively. The small scale of objects makes it difficult for TResNetL to recognize. Best view in colors.
  • Figure 5: Visualization of multi-head attention maps for the target label person. Each column in the middle represents an attention map for one head and the rightmost column averages the maps of all heads. Best view in colors.
  • ...and 3 more figures