Table of Contents
Fetching ...

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao

TL;DR

DRMot tackles the limitations of RGB-only Referring Multi-Object Tracking by introducing RGBD Referring Multi-Object Tracking (DRMOT) and the DRSet dataset, enabling 3D-aware grounding with language. It presents DRTrack, a two-stage framework that combines depth-promoted language grounding via a Multimodal Large Language Model and depth-enhanced OC-SORT for robust association, formalized with a fusion score $S_{RGBD}=\alpha\cdot IoU+(1-\alpha)\cdot S_D$ and a cost $C=-(S_{RGBD}+\lambda\cdot VDC)$. Empirical results on DRSet show large gains over RGB-only baselines (e.g., HOTA from $15.13\%$ to $33.24\%$), validating the effectiveness of depth cues and MLLM-guided grounding for 3D spatial grounding and identity stability under occlusion. The work establishes a strong baseline for DRMOT, with practical implications for robotics and autonomous systems requiring accurate, depth-aware multi-object grounding and tracking. $S_{RGBD}=\alpha\cdot IoU+(1-\alpha)\cdot S_D$ is central to the RGBD association, balancing 2D spatial consistency and depth affinity.

Abstract

Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

TL;DR

DRMot tackles the limitations of RGB-only Referring Multi-Object Tracking by introducing RGBD Referring Multi-Object Tracking (DRMOT) and the DRSet dataset, enabling 3D-aware grounding with language. It presents DRTrack, a two-stage framework that combines depth-promoted language grounding via a Multimodal Large Language Model and depth-enhanced OC-SORT for robust association, formalized with a fusion score and a cost . Empirical results on DRSet show large gains over RGB-only baselines (e.g., HOTA from to ), validating the effectiveness of depth cues and MLLM-guided grounding for 3D spatial grounding and identity stability under occlusion. The work establishes a strong baseline for DRMOT, with practical implications for robotics and autonomous systems requiring accurate, depth-aware multi-object grounding and tracking. is central to the RGBD association, balancing 2D spatial consistency and depth affinity.

Abstract

Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.
Paper Structure (26 sections, 5 equations, 7 figures, 6 tables)

This paper contains 26 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between RMOT and DRMOT. (a) RMOT Failure: RMOT model relying solely on RGB images and language (L) is unable to correctly ground the referring expression under depth-dependent spatial descriptions. Although candidate objects are detected, the absence of explicit depth cues leads to ambiguous spatial reasoning and incorrect target grounding selection. (b) DRMOT Success: By integrating RGB, language (L), and depth (D) information, the DRMOT model leverages depth cues to resolve spatial ambiguity, thereby achieving accurate target grounding and maintaining temporal identity consistency. This comparison demonstrates the necessity of depth information for disambiguating depth-related referring descriptions that are indistinguishable in the 2D image space.
  • Figure 2: Annotation process. The annotation process includes four steps: (1) Attribute Table Creation: we build an attribute table that categorizes object descriptions into static attributes and dynamic behaviors; (2) Object Selection: we review the entire video and select representative targets based on their attributes and behaviors; (3) Language Description Annotation: we draw bounding boxes frame by frame and compose language descriptions according to the predefined attributes; and (4) Annotation Verification: we perform a two-person review to ensure valid bounding box coordinates, consistent object IDs, and accurate language descriptions. Finally, all verified annotations are packed to generate the DRSet dataset.
  • Figure 3: Overview of the DRSet dataset statistics.(a) Target Category Distribution: the DRSet dataset contains 18 diverse target categories, covering humans, vehicles, animals, and everyday objects. (b) Word Cloud: DRSet dataset is composed of abundant keywords. (c) Object Count Distribution: it demonstrates the diversity of DRSet dataset, featuring multiple targets per video and a relatively uniform distribution that captures abundant information about various targets. (d) Number of Frames Distribution: it shows that DRSet covers videos of varying lengths, from short to long sequences, reflecting a balanced temporal diversity across the dataset.
  • Figure 4: Pipeline of DRTrack. The DRTrack framework consists of two primary stages. First, the Depth-Promoted Language Grounding stage utilizes a MLLM to concurrently process L, RGB, and D inputs. This MLLM is fine-tuned via Geometric-Aware GRPO using Format and IoU Rewards to output precise Bounding Boxes. Second, in the Depth-Enhanced OC-SORT Association stage, the Depth Maps are integrated to compute the RGBD Joint Similarity ($S_{\text{RGBD}}$ in \ref{['eq:final similarity.']}). This similarity, combined with the VDC motion prior, defines the Association Cost (\ref{['eq:cost matrix.']}), ensuring robust identity maintenance and final Trajectories.
  • Figure 5: Qualitative Results of DRTrack's performance on the DRSet dataset.
  • ...and 2 more figures