Table of Contents
Fetching ...

StageInteractor: Query-based Object Detector with Cross-stage Interaction

Yao Teng, Haisong Liu, Sheng Guo, Limin Wang

TL;DR

StageInteractor addresses the misalignment between supervision and predictions in multi-stage query-based detectors by introducing cross-stage interaction. It combines a cross-stage label assigner, which redistributes training targets across decoder layers based on query indices and IoU criteria, with cross-stage dynamic filter reuse, which cascades heavy dynamic operators across stages via lightweight adapters. Empirical results on MS COCO show substantial gains over prior methods, achieving 44.8 AP with a ResNet-50 backbone and 100 queries (12 epochs) and up to 52.7 AP with longer training and 300 queries on stronger backbones, marking a new state-of-the-art among query-based detectors. The approach delivers faster convergence and improved modeling capacity with modest computational overhead, offering practical benefits for scalable, high-accuracy object detection.

Abstract

Previous object detectors make predictions based on dense grid points or numerous preset anchors. Most of these detectors are trained with one-to-many label assignment strategies. On the contrary, recent query-based object detectors depend on a sparse set of learnable queries and a series of decoder layers. The one-to-one label assignment is independently applied on each layer for the deep supervision during training. Despite the great success of query-based object detection, however, this one-to-one label assignment strategy demands the detectors to have strong fine-grained discrimination and modeling capacity. To solve the above problems, in this paper, we propose a new query-based object detector with cross-stage interaction, coined as StageInteractor. During the forward propagation, we come up with an efficient way to improve this modeling ability by reusing dynamic operators with lightweight adapters. As for the label assignment, a cross-stage label assigner is applied subsequent to the one-to-one label assignment. With this assigner, the training target class labels are gathered across stages and then reallocated to proper predictions at each decoder layer. On MS COCO benchmark, our model improves the baseline by 2.2 AP, and achieves 44.8 AP with ResNet-50 as backbone, 100 queries and 12 training epochs. With longer training time and 300 queries, StageInteractor achieves 51.1 AP and 52.2 AP with ResNeXt-101-DCN and Swin-S, respectively.

StageInteractor: Query-based Object Detector with Cross-stage Interaction

TL;DR

StageInteractor addresses the misalignment between supervision and predictions in multi-stage query-based detectors by introducing cross-stage interaction. It combines a cross-stage label assigner, which redistributes training targets across decoder layers based on query indices and IoU criteria, with cross-stage dynamic filter reuse, which cascades heavy dynamic operators across stages via lightweight adapters. Empirical results on MS COCO show substantial gains over prior methods, achieving 44.8 AP with a ResNet-50 backbone and 100 queries (12 epochs) and up to 52.7 AP with longer training and 300 queries on stronger backbones, marking a new state-of-the-art among query-based detectors. The approach delivers faster convergence and improved modeling capacity with modest computational overhead, offering practical benefits for scalable, high-accuracy object detection.

Abstract

Previous object detectors make predictions based on dense grid points or numerous preset anchors. Most of these detectors are trained with one-to-many label assignment strategies. On the contrary, recent query-based object detectors depend on a sparse set of learnable queries and a series of decoder layers. The one-to-one label assignment is independently applied on each layer for the deep supervision during training. Despite the great success of query-based object detection, however, this one-to-one label assignment strategy demands the detectors to have strong fine-grained discrimination and modeling capacity. To solve the above problems, in this paper, we propose a new query-based object detector with cross-stage interaction, coined as StageInteractor. During the forward propagation, we come up with an efficient way to improve this modeling ability by reusing dynamic operators with lightweight adapters. As for the label assignment, a cross-stage label assigner is applied subsequent to the one-to-one label assignment. With this assigner, the training target class labels are gathered across stages and then reallocated to proper predictions at each decoder layer. On MS COCO benchmark, our model improves the baseline by 2.2 AP, and achieves 44.8 AP with ResNet-50 as backbone, 100 queries and 12 training epochs. With longer training time and 300 queries, StageInteractor achieves 51.1 AP and 52.2 AP with ResNeXt-101-DCN and Swin-S, respectively.
Paper Structure (22 sections, 4 equations, 9 figures, 20 tables)

This paper contains 22 sections, 4 equations, 9 figures, 20 tables.

Figures (9)

  • Figure 1: Convergence curves of our model and other query-based object detectors detrdeformabledetrsparsercnnadamixer with ResNet-50 resnet on MS COCO coco minival set.
  • Figure 2: The results of label assignment at various stages. The green box denotes the ground-truth object Person. The red and white boxes denote object prediction derived from two different queries. Pos and Neg denote the positive sample and the negative sample, respectively. (a) The white box is assigned with the ground-truth object Person by bipartite matching at the first stage, while the red box is not. But the opposite is true for the sixth stage. (b) With our cross-stage label assigner, the red box in the first stage can be assigned with the ground-truth Person.
  • Figure 3: Overview. The cross-stage interaction incorporates two parts: cross-stage label assignment and cross-stage dynamic filter reuse. During the forward propagation, dynamic filters in each stage of decode layer are reused in the subsequent stages, i.e., we stack them with specific lightweight adapters to increase the depth of each decoder layer. As for the label assignment, our cross-stage label assigner gathers the results of bipartite matching across stages, and then selects proper target labels as supervision.
  • Figure 4: Overview of AdaMixer.
  • Figure 5: The process of our cross-stage label assignment.
  • ...and 4 more figures