Table of Contents
Fetching ...

SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

Fabian Gülhan, Emil Mededovic, Yuli Wu, Johannes Stegmaier

TL;DR

SelfMOTR addresses the detection–association conflict in end-to-end transformer MOT by extracting and reusing the model's own detection signal as internally generated priors. It adds a detection-only forward pass that produces self proposals, which are then fed together with track queries into a shared decoder, keeping the pipeline detector-free and end-to-end. This decoupling stabilizes detection and strengthens association, achieving competitive results on DanceTrack (e.g., $ ext{HOTA}=69.2$, $ ext{IDF1}=72.5$) and MOT17 while avoiding external detectors. The findings suggest that internal capacity of end-to-end transformers can be allocated to detection and association more effectively, unlocking improvements without added detector modules and motivating integration into stronger query-based trackers.

Abstract

Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.

SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

TL;DR

SelfMOTR addresses the detection–association conflict in end-to-end transformer MOT by extracting and reusing the model's own detection signal as internally generated priors. It adds a detection-only forward pass that produces self proposals, which are then fed together with track queries into a shared decoder, keeping the pipeline detector-free and end-to-end. This decoupling stabilizes detection and strengthens association, achieving competitive results on DanceTrack (e.g., , ) and MOT17 while avoiding external detectors. The findings suggest that internal capacity of end-to-end transformers can be allocated to detection and association more effectively, unlocking improvements without added detector modules and motivating integration into stronger query-based trackers.

Abstract

Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.

Paper Structure

This paper contains 13 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of SelfMOTR, which exposes MOTR’s hidden detection capability and reuses its own detections as internal priors to drive a simple, detector-free end-to-end multi-object tracking pipeline.
  • Figure 2: We train MOTR with the standard tracking setup on DanceTrack sun2022dancetrack. During evaluation, we run two inference modes on the same checkpoints: (i) normal MOTR with track queries and (ii) MOTR with track queries removed.
  • Figure 3: Four ways of injecting detection priors into MOTR. (a) Detection Pretraining: MOTR is first trained as a pure detector on the target dataset with track queries disabled, and then fine-tuned in the standard tracking setting, transferring the entire detection model. (b) Query Pretraining: only the detection queries from the detection-pretrained MOTR are reused to initialize the tracking model, so that the backbone and decoder are learned from scratch while the query embeddings already encode a detection-aware prior. (c) Distillation: a frozen detection-pretrained MOTR acts as a teacher that generates additional box labels; during MOT training, a student MOTR with both detect and track queries is optimized jointly with the usual MOT loss and a Hungarian-matched detection distillation loss that enforces consistency with the teacher’s predictions. (d) Anchor Proposal: instead of learning decoder anchors from scratch, bounding boxes from the detection-pretrained MOTR are converted into 4D anchor boxes and refined in a lightweight decoder pass.
  • Figure 4: Overview of SelfMOTR. The figure shows how, at each frame, we first use MOTR’s detection branch to produce boxes and scores, convert the confident ones into self proposal queries (4D anchor + confidence-conditioned content), and then concatenate these proposals with the track queries so that the shared decoder can refine and associate them in the tracking pass.
  • Figure 5: Effect of decoder depth on accuracy. We vary the number of decoder layers used in the detection-only pass and report tracking accuracy (HOTA) on the validation and test sets, as well as detection accuracy (mAP) on the validation set.