Table of Contents
Fetching ...

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen

TL;DR

Salience DETR tackles redundancy in two-stage DETR-like detectors by introducing hierarchical salience filtering guided by scale-independent supervision, enabling transformer encoding to focus on discriminative queries. It couples this with background embedding, cross-level token fusion, and redundancy removal to address semantic misalignment and unstable initialization across scales and layers. The method achieves substantial accuracy gains across task-specific datasets (ESD, CSD, MSSD) and COCO 2017 while reducing computation, demonstrating improved efficiency without sacrificing performance. This approach offers a practical path to robust, scalable DETR-like detectors suitable for real-world, small-object and low-contrast scenarios.

Abstract

DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention, which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues, we propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries, for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries, we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements, the proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets, as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

TL;DR

Salience DETR tackles redundancy in two-stage DETR-like detectors by introducing hierarchical salience filtering guided by scale-independent supervision, enabling transformer encoding to focus on discriminative queries. It couples this with background embedding, cross-level token fusion, and redundancy removal to address semantic misalignment and unstable initialization across scales and layers. The method achieves substantial accuracy gains across task-specific datasets (ESD, CSD, MSSD) and COCO 2017 while reducing computation, demonstrating improved efficiency without sacrificing performance. This approach offers a practical path to robust, scalable DETR-like detectors suitable for real-world, small-object and low-contrast scenarios.

Abstract

DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention, which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues, we propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries, for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries, we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements, the proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets, as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.
Paper Structure (17 sections, 10 equations, 7 figures, 10 tables)

This paper contains 17 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Visualization of selected queries in two-stage initialization. Queries and object annotations are denoted in points and bounding boxes respectively. The selection results illustrate scale bias and redundancy despite one-to-one Hungarian matching.
  • Figure 2: The architecture overview of Salience DETR. We design a hierarchical query filtering for selecting layer-wise and level-wise queries (\ref{['sec:Hierarchical query filtering']}) under salience-guided supervision (\ref{['sec:salience-guided supervision']}) to mitigate the scale bias in \ref{['fig:query_visualization']}. The semantic misalignment among queries is mitigated by query refinement modules (\ref{['sec:query refinement']}).
  • Figure 3: Qualitative illustration of scale-independent supervision (top) and discrete foreground-background supervision (bottom). With salience reducing from the object center to the border, scale-independent supervision balances selected queries even for small-size objects.
  • Figure 4: Cross-level token fusion
  • Figure 5: Convergence of Salience DETR
  • ...and 2 more figures