Table of Contents
Fetching ...

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, Xuguang Lan

TL;DR

This work tackles the slow convergence of DETR by introducing an explicit position relation prior as attention bias. It proposes Relation-DETR, which includes a position relation encoder, progressive attention refinement, and a contrast relation pipeline to balance non-duplication with positive supervision in an end-to-end DETR framework. The approach yields state-of-the-art COCO results with faster convergence, plus notable gains on task-specific datasets and evidence of transferability to other DETR variants. The findings suggest that explicit structural priors in attention can substantially improve both efficiency and accuracy for transformer-based object detection, with potential for universal applicability. The accompanying SA-Det-100k results further indicate robustness across diverse visual domains.

Abstract

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR.

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

TL;DR

This work tackles the slow convergence of DETR by introducing an explicit position relation prior as attention bias. It proposes Relation-DETR, which includes a position relation encoder, progressive attention refinement, and a contrast relation pipeline to balance non-duplication with positive supervision in an end-to-end DETR framework. The approach yields state-of-the-art COCO results with faster convergence, plus notable gains on task-specific datasets and evidence of transferability to other DETR variants. The findings suggest that explicit structural priors in attention can substantially improve both efficiency and accuracy for transformer-based object detection, with potential for universal applicability. The accompanying SA-Det-100k results further indicate robustness across diverse visual domains.

Abstract

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR.
Paper Structure (21 sections, 8 equations, 6 figures, 8 tables)

This paper contains 21 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Statistical distribution of macroscopic correlation (MC) on various datasets (normalized for better visualization), and the values in brackets indicate the number of dataset samples.
  • Figure 2: Comparison of transformer decoder in Deformable-DETR(left) and Relation-DETR(right).
  • Figure 3: Detailed illustration of the proposed contrast relation pipeline.
  • Figure 4: Convergence curve(left) and Precision-recall curve for IoU=$50\%\sim95\%$(right). All models are trained with ResNet-50 backbone under the same 1$\times$ training configuration on COCO 2017.
  • Figure 5: Representative objects(red) related to the given object(blue)
  • ...and 1 more figures