Table of Contents
Fetching ...

OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Kai Luo, Hao Shi, Kunyu Peng, Fei Teng, Sheng Wu, Kaiwei Wang, Kailun Yang

TL;DR

OmniTrack++ tackles multi-object tracking in 360° panoramic imagery by unifying End-To-End and Tracking-By-Detection within a trajectory-feedback loop. It introduces four interdependent components—DynamicSSM Block for distortion-robust features, FlexiTrack Instances for short-term trajectory guidance, ExpertTrack Memory for long-term identity modeling via a Shared Mixture-of-Experts, and Tracklet Management for adaptive paradigm switching. The EmboTrack benchmark (QuadTrack and BipTrack) provides a challenging dataset to evaluate panoramic MOT in embodied robotics, and extensive experiments show state-of-the-art performance with substantial gains in HOTA and IDF1 over baselines. The work demonstrates strong robustness to egocentric motion and panoramic distortions, enabling practical panoramic perception for mobile robots and future long-term tracking in dynamic environments.

Abstract

This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360° Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360° FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.

OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

TL;DR

OmniTrack++ tackles multi-object tracking in 360° panoramic imagery by unifying End-To-End and Tracking-By-Detection within a trajectory-feedback loop. It introduces four interdependent components—DynamicSSM Block for distortion-robust features, FlexiTrack Instances for short-term trajectory guidance, ExpertTrack Memory for long-term identity modeling via a Shared Mixture-of-Experts, and Tracklet Management for adaptive paradigm switching. The EmboTrack benchmark (QuadTrack and BipTrack) provides a challenging dataset to evaluate panoramic MOT in embodied robotics, and extensive experiments show state-of-the-art performance with substantial gains in HOTA and IDF1 over baselines. The work demonstrates strong robustness to egocentric motion and panoramic distortions, enabling practical panoramic perception for mobile robots and future long-term tracking in dynamic environments.

Abstract

This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360° Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360° FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.

Paper Structure

This paper contains 34 sections, 14 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of mainstream tracking paradigms. (a) illustrates the typical End-To-End (E2E) paradigm; (b) shows the classical Tracking-By-Detection (TBD) paradigm; and (c) depicts our proposed OmniTrack++ paradigm, which adaptively integrates and switches between the two paradigms. In addition, OmniTrack++ employs a trajectory-feedback module that delivers rapid, large-FoV localization cues tailored to panoramic imagery, thereby narrowing the search space and stabilizing candidate selection, which ultimately improves data-association accuracy.
  • Figure 2: Overview of the EmboTrack benchmark (BipTrack and QuadTrack) and MOT results on the QuadTrack test set. (a) BipTrack subset captured by a bipedal wheel–legged platform. (b) QuadTrack subset recorded by a quadrupedal platform. Dots under each object box indicate the large-FoV trajectory of the target, depicting its motion path within the panoramic scene. Both subsets provide panoramic MOT scenarios. (c) Quantitative comparison on QuadTrack: HOTA (left axis) and IDF1 (right axis) of representative MOT methods under E2E and TBD paradigms; OmniTrack++ achieves the highest overall accuracy.
  • Figure 3: Pipeline overview of OmniTrack++. At frame $t$, the panoramic input is processed by a shared backbone, a DynamicSSM block, and an encoder to produce learnable instances for the current frame. In parallel, FlexiTrack Instances from frame $t-1$ are retrieved from the ExpertTrack Memory. These two sets of tokens are concatenated and fed into the decoder to generate object proposals. A Dual-Branch Adapter then routes them to either (i) a TBD branch, using hybrid distance calculation and an association algorithm for trajectory updates, or (ii) an E2E branch, using a thresholding strategy for direct updates. An Ensemble Module fuses both outputs to yield the final track set, which is written back to the ExpertTrack Memory to instantiate the FlexiTrack Instances for frame $t+1$, closing the feedback loop.
  • Figure 4: The proposed DynamicSSM Block is integrated into a standard DAB encoder as a plug-in enhancement. Rather than explicitly modeling panoramic geometry, it implicitly calibrates spatial and photometric feature distributions to mitigate geometric distortions and illumination variation. This adaptation yields more robust and stable representations, enabling more reliable decoding and multi-object tracking in panoramic scenes.
  • Figure 5: ExpertTrack Memory framework. The module integrates long-term Stable Identity Memory (SIM) and short-term Dynamic Interaction Memory (DIM) to jointly maintain identity consistency and adapt to rapid appearance changes under panoramic distortions. A Hierarchical Memory Controller (HMC) assigns high-confidence features to SIM and recent-frame updates to DIM. A Router then selects the top-$K_r$ features across both memories and forwards them to a Shared Mixture-of-Experts (MoE) module, where specialized experts handle diverse appearance variations—such as illumination inconsistency and geometric deformation. The aggregated expert outputs are fused into the FlexiTrack Instance, enabling robust and adaptive identity association across panoramic views.
  • ...and 7 more figures