Table of Contents
Fetching ...

Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection

Shoumeng Qiu, Xinrun Li, Yang Long

TL;DR

A novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.

Abstract

Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.

Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection

TL;DR

A novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.

Abstract

Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.
Paper Structure (15 sections, 11 equations, 3 figures, 4 tables)

This paper contains 15 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of the proposed Match-Free training paradigm. The framework initiates by feeding ground-truth (GT) entities and predicted object queries into the GT-Probe Module to compute a dense correspondence matrix. This matrix is subsequently refined by the Sparse Correspondence Generation module to produce a sparsified assignment topology. Simultaneously, a Broadcast Cost matrix is constructed by calculating the pairwise distance between all queries and GTs. The final optimization is conducted by modulating the broadcast cost with the learned dense and sparse correspondences to generate the correspondence weight loss ($\mathcal{L}_w$) and query loss ($\mathcal{L}_q$), respectively.
  • Figure 2: Detailed architecture of the GT-Probe Module. The module facilitates query-target alignment by encoding ground-truth (GT) entities and predicted object queries via MLP layers. A cross-attention mechanism is employed where the GT embeddings function as the queries (q) to probe the predicted query bank, which serves as the keys (k) and values (v). This process yields a dense correspondence weight matrix.
  • Figure 3: Internal workflow of the Sparse Correspondence Generation (SCG) module. The module refines dense correspondence weights into a sparse assignment topology through a multi-stage filtering process. It sequentially applies row-wise maximum filtering and column-wise maximum filtering to identify local and global saliency. A dynamic thresholding mechanism: $>/<$, controlled by the sparsity coefficient $\rho$, is then employed, followed by row-wise normalization to produce the final sparse, stable supervision weights for query refinement.