Table of Contents
Fetching ...

NetTrack: Tracking Highly Dynamic Objects with a Net

Guangze Zheng, Shijie Lin, Haobo Zuo, Changhong Fu, Jia Pan

TL;DR

NetTrack addresses the challenge of tracking highly dynamic open-world objects by introducing a fine-grained Net that uses points of interest for robust association and a fine-grained object-text grounding module for precise localization. The approach blends a fine-grained sampler and matching with grounding-based, open-vocabulary prompts, enabling strong generalization without fine-tuning across diverse benchmarks. The authors propose the Bird Flock Tracking (BFT) dataset to stress-test dynamicity and demonstrate state-of-the-art performance on BFT along with strong zero-shot transfer to TAO, TAO-OW, AnimalTrack, and GMOT-40. These results illustrate the potential of fine-grained learning to enhance open-world MOT, with practical implications for ecological inspection, video editing, and descriptor-guided tracking workflows.

Abstract

The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: https://george-zhuang.github.io/nettrack/.

NetTrack: Tracking Highly Dynamic Objects with a Net

TL;DR

NetTrack addresses the challenge of tracking highly dynamic open-world objects by introducing a fine-grained Net that uses points of interest for robust association and a fine-grained object-text grounding module for precise localization. The approach blends a fine-grained sampler and matching with grounding-based, open-vocabulary prompts, enabling strong generalization without fine-tuning across diverse benchmarks. The authors propose the Bird Flock Tracking (BFT) dataset to stress-test dynamicity and demonstrate state-of-the-art performance on BFT along with strong zero-shot transfer to TAO, TAO-OW, AnimalTrack, and GMOT-40. These results illustrate the potential of fine-grained learning to enhance open-world MOT, with practical implications for ecological inspection, video editing, and descriptor-guided tracking workflows.

Abstract

The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: https://george-zhuang.github.io/nettrack/.
Paper Structure (23 sections, 3 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: a The visualization of the proposed NetTrack is similar to a Net. Object dynamicity distorts the internal relationships of the object, presenting challenges for traditional coarse-grained tracking methods that rely solely on bounding boxes. While NetTrack introduces fine-grained Nets that are robust to dynamicity. b Qualitative results of NetTrack tracking highly dynamic objects under open-world tracking and referring expression comprehension settings. Dynamicity like deformation and fast motion results in drastic changes in the coarse-grained representation, while the fine-grained Nets can contract robustly. The dashed boxes represent the object position from the previous time step. c We propose a challenging benchmark named BFT, dedicated to evaluating highly dynamic object tracking with abundant scenarios shown in the external circular and diverse species shown in the central word cloud.
  • Figure 2: Comparison between the localization method owl-vitvideo-owlvit based on coarse-grained object-text correspondence and our fine-grained method. Our fine-grained approach localizes dynamic objects better and can leverage professional descriptions from embedded descriptors (GPT-3.5 gpt in the example) with a better understanding of context.
  • Figure 3: Dynamicity-aware association in the NetTrack framework. Unlike the coarse-grained association methods that only learn the box motion or overall appearance, dynamicity-aware association benefits from fine-grained Nets which are robust against the open-world dynamicity and exhibit stronger generalization ability.
  • Figure 4: a The diverse geographical distribution of some representative flying bird species exhibits the diversity of BFT. The numbers on the map represent the number of videos in each corresponding area, e.g., 30 videos from North America. b Dynamicity comparison between BFT and other datasets on aspect ratio change. The more dispersed distribution means more frequent object deformation and occlusion in BFT. c Dynamicity comparison between BFT and other datasets on object motion. The larger object motion in BFT represents the faster motion of objects.
  • Figure 5: Comparison of detachable modules in the proposed framework, where the SoTA grounding-based detectors glipgroundingdino as I, II and point trackers pipstapircotracker as a, b, c are considered. The robust performance in module variations confirms the excellent generality of the proposed framework.
  • ...and 10 more figures