Table of Contents
Fetching ...

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

Zhixiong Nan, Xianghong Li, Jifeng Dai, Tao Xiang

TL;DR

The paper tackles limited feature utilization in DETR-like detectors caused by cascaded decoders. It introduces MI-DETR, a parallel Multi-time Inquiries framework with U-like Feature Interaction to let object queries learn multiple patterns from multi-layer image features. Across COCO experiments with ResNet-50 and Swin-L backbones, MI-DETR achieves state-of-the-art gains over representative DETR-like models, including +0.7 AP (12 epochs) and +0.6 AP (24 epochs) over Relation-DETR, and notable improvements on challenging scenes. Diagnostic and visualization studies validate the approach's effectiveness, interpretability, and plug-in simplicity for enhancing transformer-based object detection.

Abstract

Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture. The cascaded decoder architecture constrains object queries to update in the cascaded direction, only enabling object queries to learn relatively-limited information from image features. However, the challenges for object detection in natural scenes (e.g., extremely-small, heavily-occluded, and confusingly mixed with the background) require an object detection model to fully utilize image features, which motivates us to propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. MI enables object queries to learn more comprehensive information, and our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark under different backbones and training epochs, achieving +2.3 AP and +0.6 AP improvements compared to the most representative model DINO and SOTA model Relation-DETR under ResNet-50 backbone. In addition, a series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

TL;DR

The paper tackles limited feature utilization in DETR-like detectors caused by cascaded decoders. It introduces MI-DETR, a parallel Multi-time Inquiries framework with U-like Feature Interaction to let object queries learn multiple patterns from multi-layer image features. Across COCO experiments with ResNet-50 and Swin-L backbones, MI-DETR achieves state-of-the-art gains over representative DETR-like models, including +0.7 AP (12 epochs) and +0.6 AP (24 epochs) over Relation-DETR, and notable improvements on challenging scenes. Diagnostic and visualization studies validate the approach's effectiveness, interpretability, and plug-in simplicity for enhancing transformer-based object detection.

Abstract

Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture. The cascaded decoder architecture constrains object queries to update in the cascaded direction, only enabling object queries to learn relatively-limited information from image features. However, the challenges for object detection in natural scenes (e.g., extremely-small, heavily-occluded, and confusingly mixed with the background) require an object detection model to fully utilize image features, which motivates us to propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. MI enables object queries to learn more comprehensive information, and our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark under different backbones and training epochs, achieving +2.3 AP and +0.6 AP improvements compared to the most representative model DINO and SOTA model Relation-DETR under ResNet-50 backbone. In addition, a series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.

Paper Structure

This paper contains 24 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) We propose the parallel multi-time inquiries mechanism with parameters-dependent inquiry heads and fusion operation in a decoder layer; (b) In recent proposed DETR-like models, primary and auxiliary queries are concatenated and inputted to the same inquiry head in a decoder layer, thus this kind of parallel architecture is parameters-sharing parallel.
  • Figure 2: The overview of MI-DETR. The main novelty is that MI-DETR uses Multi-time Inquiries (MI) decoder layers to replace the traditional decoder layers adopted in previous DETR-like models. Backbone and $L$-layer transformer encoder extract the image features $\bm{E}=\{\bm{E}_0, \ldots, \bm{E}_L\}$. For $i$-thMI decoder layer, the input is object queries $\bm{Q}_{i-1}$ and the output is $\bm{Q}_i$. For inquiry heads in $i$-thMI decoder layer, object queries learn multi-pattern information by interacting with $\bm{E}_j$, where $j=L-i+1$, and $\bm{E}_j$ is the corresponding image features after the processing of U-like Feature Interaction. The output of the last MI decoder layer (i.e.$\bm{Q}_L$) is used to predict the locations and categories of objects.
  • Figure 3: The architecture of Lite-MI.
  • Figure 4: The visualization of object queries in different inquiry heads by T-SNE high-dimensional data visualization tool. More results of \ref{['fig:query_visual']}, \ref{['fig:head_visual']}, and \ref{['fig:result_visual']} can be found in supplementary material.
  • Figure 5: Object detection results based on the single inquiry head and multiple inquiry heads.
  • ...and 4 more figures