Table of Contents
Fetching ...

COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

Zekun Qian, Wei Feng, Ruize Han, Junhui Hou

Abstract

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.

COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

Abstract

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.
Paper Structure (21 sections, 19 equations, 9 figures, 7 tables)

This paper contains 21 sections, 19 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of OVMOT challenges and our solutions. (a) Continuous Data Foundation: TAO's sparse annotations (every 30 frames) miss critical intermediate states. C-TAO provides continuous frame-by-frame annotations, enabling smooth trajectory learning. (b) Synergistic Framework: COVTrack++ operates through two complementary stages that mutually reinforce each other. Association Feature Enhancement Stage systematically constructs robust association features through ➀ Multi-Cue Adaptive Fusion (MCF) and ➁ Multi-Granularity Hierarchical Aggregation (MGA). Detection Quality Improvement Stage recovers flickering detections via ➂ Temporal Confidence Propagation (TCP). The proposed framework leverages reliable temporal propagation features from the association stage to improve the detection results; the recovered detections with boosted confidence serve as high-quality tracking sources for association in subsequent frames. This creates a bidirectional reciprocal mechanism that progressively improves both tracking continuity and detection quality.
  • Figure 2: Visualization of annotation statistics comparison between our dataset (C-TAO) and TAO. Left: total number of annotated frames and bounding boxes. Middle: average statistics per video and per track. Right: continuity statistics between consecutive annotated frames.
  • Figure 3: Annotation examples in challenging scenarios. Solid boxes represent original TAO annotations (30-frame intervals), while dashed boxes show our continuous annotations. (a) Progressive occlusion process of a pedestrian behind a car. (b) Continuous viewpoint transition of a vehicle under camera motion. (c) Progressive appearance evolution during a bird's pose transformation. C-TAO annotations capture crucial intermediate states that are missed in the original sparse annotations.
  • Figure 4: Overall framework of our method containing three complementary modules. (a) Multi-Cue Feature Adaptive Fusion (MCF) dynamically weights appearance, location, and semantic features based on intra-frame and inter-frame confidence. (b) Multi-Granularity Hierarchical Aggregation (MGA) enhances parent objects (e.g., trolley) by aggregating spatial and semantic information from child parts (e.g., luggage, wheels) through cross-attention. (c) Temporal Confidence Propagation (TCP) recovers low-confidence candidates in frame $t$ by leveraging high-confidence sources from frame $t-1$ via a bipartite graph matching. $\bigotimes$ and $\bigoplus$ represent scaling multiplication and concatenation operations.
  • Figure 5: Illustration of detection flickering causing trajectory discontinuity. Green boxes indicate successful detections with high confidence, while red dashed boxes show missed detections due to confidence fluctuations. Despite minimal visual changes between consecutive frames, confidence score variations lead to detection flickering and fragmented trajectories.
  • ...and 4 more figures