Table of Contents
Fetching ...

Hausdorff Distance Matching with Adaptive Query Denoising for Rotated Detection Transformer

Hakjin Lee, Minki Song, Jamyoung Koo, Junghoon Seo

TL;DR

This work proposes an adaptive query denoising method that employs bipartite matching to selectively eliminate noised queries that detract from model improvement, and when compared to models adopting a ResNet-50 backbone, this model yields remarkable improvements.

Abstract

Detection Transformers (DETR) have recently set new benchmarks in object detection. However, their performance in detecting rotated objects lags behind established oriented object detectors. Our analysis identifies a key observation: the boundary discontinuity and square-like problem in bipartite matching poses an issue with assigning appropriate ground truths to predictions, leading to duplicate low-confidence predictions. To address this, we introduce a Hausdorff distance-based cost for bipartite matching, which more accurately quantifies the discrepancy between predictions and ground truths. Additionally, we find that a static denoising approach impedes the training of rotated DETR, especially as the quality of the detector's predictions begins to exceed that of the noised ground truths. To overcome this, we propose an adaptive query denoising method that employs bipartite matching to selectively eliminate noised queries that detract from model improvement. When compared to models adopting a ResNet-50 backbone, our proposed model yields remarkable improvements, achieving $\textbf{+4.18}$ AP$_{50}$, $\textbf{+4.59}$ AP$_{50}$, and $\textbf{+4.99}$ AP$_{50}$ on DOTA-v2.0, DOTA-v1.5, and DIOR-R, respectively.

Hausdorff Distance Matching with Adaptive Query Denoising for Rotated Detection Transformer

TL;DR

This work proposes an adaptive query denoising method that employs bipartite matching to selectively eliminate noised queries that detract from model improvement, and when compared to models adopting a ResNet-50 backbone, this model yields remarkable improvements.

Abstract

Detection Transformers (DETR) have recently set new benchmarks in object detection. However, their performance in detecting rotated objects lags behind established oriented object detectors. Our analysis identifies a key observation: the boundary discontinuity and square-like problem in bipartite matching poses an issue with assigning appropriate ground truths to predictions, leading to duplicate low-confidence predictions. To address this, we introduce a Hausdorff distance-based cost for bipartite matching, which more accurately quantifies the discrepancy between predictions and ground truths. Additionally, we find that a static denoising approach impedes the training of rotated DETR, especially as the quality of the detector's predictions begins to exceed that of the noised ground truths. To overcome this, we propose an adaptive query denoising method that employs bipartite matching to selectively eliminate noised queries that detract from model improvement. When compared to models adopting a ResNet-50 backbone, our proposed model yields remarkable improvements, achieving AP, AP, and AP on DOTA-v2.0, DOTA-v1.5, and DIOR-R, respectively.
Paper Structure (27 sections, 7 equations, 9 figures, 18 tables)

This paper contains 27 sections, 7 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Challenges in our rotated detection transformer. Left: The Hausdorff distance addresses the duplicate low-confidence predictions resulting from square-like problem. Right: The proposed adaptive query denoising employs bipartite matching between positive noised queries and accurate predictions to selectively filter queries for denoising.
  • Figure 2: Matching areas of the Prediction A to the ground truth. The blue area indicates the orange box is matched to the ground truth over the green box, as the center of the orange box moves along a coordinate axis. In each case, both the ground truth and the green box are fixed. Left: Using L1 cost, the orange box which is too far from the ground truth is matched to it over the green box. Right: When using the KLD cost, the matching ignores the axis alignment between two boxes. Conversely, the Hausdorff cost takes this into account.
  • Figure 3: Left: Contrastive query denoising where noised queries and ground truths are directly matched, leading to potential misclassifications. Right: Adaptive query denoising where bipartite matching selectively filters out noised queries, improving the accuracy of predictions as training progresses.
  • Figure 4: Adaptive query denoising filters out unhelpful noised queries as accuracy improves.
  • Figure 5: Comparison of attention layers in different models. (a) ARS-DETR, based on Deformable DETR, predicts reference boxes from learnable queries but does not fully integrate angle information into the objects' spatial information. (b) DAB-DETR, which our model builds upon, defines learnable queries as horizontal anchors. (c) Our model extends this approach by defining anchors as rotated boxes, leveraging methods such as iterative refinement and two-stage proposals from DINO, resulting in improved integration of angle information.
  • ...and 4 more figures