Table of Contents
Fetching ...

Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation

Yingying Zhang, Chuangji Shi, Xin Guo, Jiangwei Lao, Jian Wang, Jiaotuan Wang, Jingdong Chen

TL;DR

This work targets the content-query deficiency in DETR-like detectors and introduces Self-Adaptive Content Query (SACQ), which leverages self-attention pooling over encoder features to initialize and refine content queries. It further proposes Query Aggregation (QA) to merge similar high-quality predictions, addressing instability in one-to-one Hungarian matching. Together, SACQ and QA deliver consistent improvements across six DETR variants on the COCO dataset, achieving an average AP gain exceeding 1.0. The approach is plug-and-play, does not rely on altering the positional query, and demonstrates robust gains while shedding light on the role of content priors in cross-attention for object localization. These results suggest a practical path to more accurate DETR-based detectors in real-world applications.

Abstract

The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP.

Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation

TL;DR

This work targets the content-query deficiency in DETR-like detectors and introduces Self-Adaptive Content Query (SACQ), which leverages self-attention pooling over encoder features to initialize and refine content queries. It further proposes Query Aggregation (QA) to merge similar high-quality predictions, addressing instability in one-to-one Hungarian matching. Together, SACQ and QA deliver consistent improvements across six DETR variants on the COCO dataset, achieving an average AP gain exceeding 1.0. The approach is plug-and-play, does not rely on altering the positional query, and demonstrates robust gains while shedding light on the role of content priors in cross-attention for object localization. These results suggest a practical path to more accurate DETR-based detectors in real-world applications.

Abstract

The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP.
Paper Structure (30 sections, 5 equations, 8 figures, 7 tables)

This paper contains 30 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of multi-scale deformable attention of the first decoder layer between vanilla Deformable-DETR and Deformable-DETR with SACQ. We draw the sampling points and attention weights from multi-scale feature maps in one picture. Each sampling point is marked as a filled circle whose color indicates its attention weight. The red rectangle is the predicted bounding box of correspoinding query.
  • Figure 2: The left portion of the diagram depicts the structure of the proposed SAPM. Features from the transformer encoder are projected into $q$ attention maps through the attention map projection modules. For each feature from the encoder, its elements are weighted according to certain attention map in the spatial dimension and then averaged to create a spatially pooled feature. The right portion illustrates the integration of SACQ into the transformer decoder of DETR's variants. SACQ generates the initial content query for the first layer of the decoder from the features produced by the transformer encoder. Starting from the second layer of the decoder, SACQ utilizes SAPM to enhance the content query based on the previous box prediction.
  • Figure 3: (a) shows the vanilla decoder of transformer. The candidate predictions generated from queries are directly matched with targets. (b) show the decoder with our query aggregation strategy. Candidate predictions are first merged according to similarity metric and then matched with targets.
  • Figure 4: The attention maps from the SACQ module are visualized to correspond with detected objects, each encased within a red bounding box. These maps exhibit a well focus on the predicted object, indicating their efficacy in extracting features that are relevant to the target. The ability to precisely concentrate on specific object confirms that generated features are suitable for the initialization of content queries.
  • Figure 5: Visualization of activated query's bounding box (green boxes) and its highly overlapped (IoU $> 0.8$) bounding boxes (red boxes) from queries with suppressed low scores. Improved content query initialization from SACQ generates more potential queries with similar bboxes, which can be further addressed by QA.
  • ...and 3 more figures