Table of Contents
Fetching ...

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao

TL;DR

This work proposes a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions.

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

TL;DR

This work proposes a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions.

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
Paper Structure (32 sections, 8 equations, 10 figures, 5 tables)

This paper contains 32 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison between RMOT and ORMOT. The wide field of view from the omnidirectional camera not only provides spatial advantages but also extends tracking duration by offering "extended temporal context", enabling ORMOT models to correctly understand long-horizon language description and accurately track objects. In contrast, conventional cameras have limited fields of view, making it more difficult for existing common RMOT models to understand long-horizon language description and perform accurate tracking.
  • Figure 2: Overview of the multi-stage language description annotation pipeline. This pipeline includes three-step: (1) Step 1 involves selecting the semantic content for reference and identifying representative keyframes through a combination of algorithmic detection and GPT-assisted refinement. (2) Step 2 leverages a large language model (GPT-4o) to generate diverse and factual descriptions based on the selected keyframes and predefined content. (3) Step 3 ensures data quality by matching each description to the corresponding person IDs in the visualized ground truth, verifying the accuracy of descriptions, enriching linguistic expressions, and aligning them with temporal segments in the video.
  • Figure 3: Visualization of the four omnidirectional-specific descriptors. (a) Boundary-crossing motion links disappearing and reappearing objects, resolving seam ambiguity. (b) Circumferential orientation cues use a 360° coordinate system for globally consistent direction. (c) Projection-aware semantic disambiguation to understand real motion in scenes. (d) Field-of-view transition marking distinguishes physical exits from view limitations.
  • Figure 4: Overview of the ORSet dataset statistics. (a) Word Cloud: Frequent terms primarily cover appearance (e.g., "wearing", "black"), actions (e.g., "walking"), and omnidirectional-specific cues (e.g., "edge"). The vocabulary's diversity indicates that the annotations are rich and cover multifaceted information, while terms like "edge" reflect the unique properties of omnidirectional imagery. (b) Distribution of Language Description Types: Descriptions are mainly categorized into appearance, action, and omnidirectional-specific descriptors. Most descriptions are compositional, blending multiple attributes. (c) Distribution of Language Description Lengths: The majority of descriptions are concentrated in the 20-80 character range, suggesting the annotation language description is concise. (d) Distribution of Track Lengths: The dataset provides balanced coverage across short- and long-term tracks, supporting comprehensive temporal reasoning for long-horizon language descriptions.
  • Figure 5: Pipeline of ORTrack. It comprises three components: (1) Language-guided detection via LVLM: Using a Large Vision-Language Model (LVLM) as an open-vocabulary detector to output bounding boxes conditioned on the language description. (2) Two-stage cropping-based feature extraction: Hierarchical region extraction and feature encoding to obtain discriminative features. (3) Cross-frame association: Linking detected boxes via cosine similarity and Hungarian matching to maintain consistent object identities.
  • ...and 5 more figures