Table of Contents
Fetching ...

MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao

TL;DR

This paper identifies the root cause of poor detection in end-to-end MOTR as an unfair label assignment between detect and track queries and introduces Release-Fetch Supervision (RFS) to balance supervision without extra detectors. Complementary strategies, Pseudo Label Distillation (PLD) and Track Group Denoising (TGD), further strengthen detection and association, respectively. The resulting MOTRv3 substantially improves end-to-end MOT on DanceTrack and MOT17, outperforming MOTR by large margins and matching or exceeding MOTRv2 without requiring an auxiliary detector during inference. Together, these strategies demonstrate a practical path to robust, fully end-to-end multi-object tracking with improved convergence and stability.

Abstract

Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.

MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

TL;DR

This paper identifies the root cause of poor detection in end-to-end MOTR as an unfair label assignment between detect and track queries and introduces Release-Fetch Supervision (RFS) to balance supervision without extra detectors. Complementary strategies, Pseudo Label Distillation (PLD) and Track Group Denoising (TGD), further strengthen detection and association, respectively. The resulting MOTRv3 substantially improves end-to-end MOT on DanceTrack and MOT17, outperforming MOTR by large margins and matching or exceeding MOTRv2 without requiring an auxiliary detector during inference. Together, these strategies demonstrate a practical path to robust, fully end-to-end multi-object tracking with improved convergence and stability.

Abstract

Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.
Paper Structure (20 sections, 5 equations, 6 figures, 6 tables)

This paper contains 20 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison among MOTR, MOTRv2, and MOTRv3 (ours). The differences in MOTRv2 and MOTRv3 compared with MOTR are marked in red brown. Locked GTs are the labels that are assigned to track queries and free GTs are the ones used to train detect queries.
  • Figure 2: Figure (a) and (b) show the activation number of different detect queries with and without the proposed RFS strategy during the training process. Figure (c) and (d) illustrate the dynamics of 2D box label percentages assigned to the detection and association parts in the conditions with and without RFS.
  • Figure 3: Overview of the MOTRv3 training pipeline. We primarily illustrate the three proposed strategies (RFS, PLD and TGD) in this figure.
  • Figure 4: Model scaling up. Res50, Conv-T, Conv-S and Conv-B denote ResNet-50, ConvNext-Tiny, ConvNext-Small and ConvNext-Base, respectively.
  • Figure 5: Illustration of the TGD strategy. We only illustrate the process of one decoder layer for example, and the other decoders share the same procedures. First of all, the original track queries are expanded into $K$ track query groups. Subsequently, the decoder takes in all these query groups to perform one-to-one matching. Besides expanding track queries, the reference boxes of queries are also expanded as $K$ groups, and random noise is added to these reference boxes. To prevent information leakage between original track queries and the expanded track query groups, an attention mask is applied.
  • ...and 1 more figures