Table of Contents
Fetching ...

Don't let the information slip away

Taozhe Li

TL;DR

An object detection model called Association DETR is proposed, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset, and believes that background information can significantly aid object detection tasks.

Abstract

Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

Don't let the information slip away

TL;DR

An object detection model called Association DETR is proposed, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset, and believes that background information can significantly aid object detection tasks.

Abstract

Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.
Paper Structure (18 sections, 4 figures, 5 tables)

This paper contains 18 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Association DETR Overview. The input image is first fed into the backbone network. The shallowest image feature is denoted as $S_1$, the second shallowest as $S_2$, and the deepest as $S_3$. we feed the shallowest feature $S_1$ into the Background Attention Module, which is designed to capture background information. Also in Figure \ref{['background-attention']}. We visualize some sample for better understanding the function of the Background Attention Module. And The Features $S_1$, $S_2$, and $S_3$ are fed into the Hybrid Encoder for both intra-feature and inter-feature enhancement. The output $F_b$ of the Background Attention Module is then fed into the Association Module, which performs feature enhancement related to background information. After that, the output of the Association Module, $F_a$ performs an addition operation with $F_b$. Additionally, Feature $F_b$ is added to $F_{3}$ as $F_{\hat{3}}$, the $F_{3}$ refers to the output of the Hybrid Encoder corresponding to the input $S_3$. Finally, the features $F_1$, $F_2$, and $F_{\hat{3}}$ undergo query selection and are passed into the Decoder and Detection Head to predict object bounding boxes and classes.
  • Figure 2: Background Attention Module & Single RFCBAMConv Block.On the left side is the structure of the Background Attention Module, and on the right side are the details of a single RFCBAMConv Block, which is located within the Background Association Module.
  • Figure 3: Association Module. We incorporate ConvFFN and Window Attention for trade off between performance and speed.
  • Figure 4: Visualization of Background Attention. BG refers to background. The figure is plotted by pytorch-grad-cam. The BAM (Background Attention Module) effectively captures background information across various scenarios. In the first image featuring bears, it successfully identifies the grass behind them. In the second image, it accurately detects grass and even fencing, despite fencing is not included in our training categories. Furthermore, in the third and fourth images, the module correctly identifies the road, sky, and grass.