DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking
Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao
TL;DR
DRMot tackles the limitations of RGB-only Referring Multi-Object Tracking by introducing RGBD Referring Multi-Object Tracking (DRMOT) and the DRSet dataset, enabling 3D-aware grounding with language. It presents DRTrack, a two-stage framework that combines depth-promoted language grounding via a Multimodal Large Language Model and depth-enhanced OC-SORT for robust association, formalized with a fusion score $S_{RGBD}=\alpha\cdot IoU+(1-\alpha)\cdot S_D$ and a cost $C=-(S_{RGBD}+\lambda\cdot VDC)$. Empirical results on DRSet show large gains over RGB-only baselines (e.g., HOTA from $15.13\%$ to $33.24\%$), validating the effectiveness of depth cues and MLLM-guided grounding for 3D spatial grounding and identity stability under occlusion. The work establishes a strong baseline for DRMOT, with practical implications for robotics and autonomous systems requiring accurate, depth-aware multi-object grounding and tracking. $S_{RGBD}=\alpha\cdot IoU+(1-\alpha)\cdot S_D$ is central to the RGBD association, balancing 2D spatial consistency and depth affinity.
Abstract
Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.
