Table of Contents
Fetching ...

Tracking by Detection and Query: An Efficient End-to-End Framework for Multi-Object Tracking

Shukun Jia, Shiyu Hu, Yichao Cao, Feng Yang, Xin Lu, Xiaobo Lu

Abstract

Multi-object tracking (MOT) is primarily dominated by two paradigms: tracking-by-detection (TBD) and tracking-by-query (TBQ). While TBD offers modular efficiency, its fragmented association pipeline often limits robustness in complex scenarios. Conversely, TBQ enhances semantic modeling end-to-end but suffers from high training costs and slow inference due to the tight coupling of detection and association. In this work, we propose the tracking-by-detection-and-query framework, TBDQ-Net, to advance the synergy between TBD and TBQ paradigms. By integrating a frozen detector with a lightweight associator, this architecture ensures intrinsic efficiency. Within this streamlined framework, we introduce tailored designs to address MOT-specific challenges. Concretely, we alleviate task conflicts and occlusions through the dual-stream update of the Basic Information Interaction (BII) module. The Content-Position Alignment (CPA) module further refines both content and positional components, providing well-aligned representations for association decoding. Extensive evaluations on DanceTrack, SportsMOT, and MOT20 benchmarks demonstrate that TBDQ-Net achieves a favorable efficiency-accuracy trade-off in challenging scenarios. Specifically, TBDQ-Net outperforms leading TBD methods by 6.0 IDF1 points on DanceTrack and achieves the best performance among TBQ methods in the crowded MOT20 benchmark. Relative to MOTRv2, TBDQ-Net reduces trainable parameters by approximately 80% while accelerating practical inference by 37.5%. These results highlight TBDQ-Net as an efficient alternative to heavy architectures, showcasing the efficacy of lightweight design. Source code is publicly available at https://github.com/FaithFlow/TBDQ-Net.

Tracking by Detection and Query: An Efficient End-to-End Framework for Multi-Object Tracking

Abstract

Multi-object tracking (MOT) is primarily dominated by two paradigms: tracking-by-detection (TBD) and tracking-by-query (TBQ). While TBD offers modular efficiency, its fragmented association pipeline often limits robustness in complex scenarios. Conversely, TBQ enhances semantic modeling end-to-end but suffers from high training costs and slow inference due to the tight coupling of detection and association. In this work, we propose the tracking-by-detection-and-query framework, TBDQ-Net, to advance the synergy between TBD and TBQ paradigms. By integrating a frozen detector with a lightweight associator, this architecture ensures intrinsic efficiency. Within this streamlined framework, we introduce tailored designs to address MOT-specific challenges. Concretely, we alleviate task conflicts and occlusions through the dual-stream update of the Basic Information Interaction (BII) module. The Content-Position Alignment (CPA) module further refines both content and positional components, providing well-aligned representations for association decoding. Extensive evaluations on DanceTrack, SportsMOT, and MOT20 benchmarks demonstrate that TBDQ-Net achieves a favorable efficiency-accuracy trade-off in challenging scenarios. Specifically, TBDQ-Net outperforms leading TBD methods by 6.0 IDF1 points on DanceTrack and achieves the best performance among TBQ methods in the crowded MOT20 benchmark. Relative to MOTRv2, TBDQ-Net reduces trainable parameters by approximately 80% while accelerating practical inference by 37.5%. These results highlight TBDQ-Net as an efficient alternative to heavy architectures, showcasing the efficacy of lightweight design. Source code is publicly available at https://github.com/FaithFlow/TBDQ-Net.

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of mainstream paradigms. (a) Tracking-by-Detection (TBD): A decoupled but fragmented framework that typically relies on heuristic, non-differentiable matching, limiting association robustness in complex scenarios. (b) Tracking-by-Query (TBQ): An end-to-end trainable framework that enhances coherence but suffers from heavy computational costs due to the coupled detection-association architecture. (c) Tracking-by-Detection-and-Query (TBDQ-Net, ours): A unified framework integrating a frozen, pre-trained detector with a lightweight, learnable associator. It effectively synthesizes the advantages of both paradigms: achieving the high efficiency of decoupled architectures while maintaining the strong end-to-end tracking capability.
  • Figure 2: Architectural overview of TBDQ-Net. The framework follows an end-to-end query-based tracking paradigm, modeling target trajectories via track queries ($T$), while capturing emerging objects through detection queries ($D$). For each frame, a frozen detector extracts a raw candidate pool $\mathcal{D}$. Subsequently, the selection strategies $C_1$ and $C_2$ partition the raw candidates into qualified detection queries $D_q$ and noisy queries $N_q$ based on confidence scores $s(q)$ and the threshold value $\tau_q$. The internal mechanism of the associator (dashed box) primarily consists of two key components: (1) The Basic Information Interaction (BII) module. It facilitates dual-stream refinement: the BII-D stream filters detection queries by perceiving existing tracks to alleviate task conflicts, while the BII-T stream updates trajectories by integrating historical context ($H_q$) with the latest detection priors ($D$). (2) The Content-Position Alignment (CPA) module. It reconciles the semantic gap, which is primarily indicated by the temporally lagged $T_p^{-1}$, between query content ($Q_q$) and spatial embeddings ($Q_p$) using global image features $\mathcal{F}$ and positional encodings $\mathcal{P}$. To enhance training, hierarchical supervision is applied through auxiliary ($\hat{y}^{aux}$) and main ($\hat{y}^{main}$) predictions.
  • Figure 3: Visualization of attention weights in the BII dual-stream. In each case, the left part presents a tracking scenario. The dashed and solid bounding boxes represent current-frame detections and previous tracking results, respectively. Yellow highlights indicate noteworthy individuals. The right part illustrates the attention matrices. $W_d$ and $W_t$ visualize the attention weights between Query and Key components as defined in Equation \ref{['biidet']} and \ref{['biitrack']}. The matrix indices $(0, 1, ...)$ correspond to the object IDs shown in the left images.
  • Figure 4: Visualization of tracking results in diverse challenging scenarios. Yellow stars signify the noteworthy individuals in each frame. Object IDs are discernible upon magnified viewing.
  • Figure 5: Visualization of typical failure cases in MOT20. Top row: Ground truth annotations from the training set. Pedestrians #235 and #236 (yellow stars) are fully occluded during $t_2 \to t_3$ but are annotated via linear interpolation between their visible states at $t_1$ and $t_4$, contrasting with the stationary pedestrian #178. This annotation style implies an offline tracking paradigm. Bottom row: Tracking results from the test set. Pedestrian #84 (red star), undergoing similar occlusion, suffers an ID switch upon reappearance at $t_4$ (#111). In contrast, pedestrian #6 (green star), despite severe overlap ($t_2$) and occlusion ($t_3$), is successfully tracked, demonstrating the model's robustness under standard online conditions.