Table of Contents
Fetching ...

Cognitive Disentanglement for Referring Multi-Object Tracking

Shaofeng Liang, Runwei Guan, Wangwang Lian, Daizong Liu, Xiaolou Sun, Dongming Wu, Yutao Yue, Weiping Ding, Hui Xiong

TL;DR

RMOT requires accurate localization and tracking of language-specified objects in video. This work introduces Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT), inspired by ventral ('what') and dorsal ('where') streams to separate static object attributes from spatial-motion information, and combines Bidirectional Interactive Fusion, Progressive Semantic-Decoupled Query Learning, and Structural Consistency Constraint. The approach yields state-of-the-art results on Refer-KITTI and Refer-KITTI-V2, with substantial gains in HOTA and related metrics while maintaining practical efficiency. These findings demonstrate the potential of cognitive-inspired, multi-source information fusion to improve language-guided tracking in complex scenes and offer insights for broader multimodal perception tasks.

Abstract

As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

Cognitive Disentanglement for Referring Multi-Object Tracking

TL;DR

RMOT requires accurate localization and tracking of language-specified objects in video. This work introduces Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT), inspired by ventral ('what') and dorsal ('where') streams to separate static object attributes from spatial-motion information, and combines Bidirectional Interactive Fusion, Progressive Semantic-Decoupled Query Learning, and Structural Consistency Constraint. The approach yields state-of-the-art results on Refer-KITTI and Refer-KITTI-V2, with substantial gains in HOTA and related metrics while maintaining practical efficiency. These findings demonstrate the potential of cognitive-inspired, multi-source information fusion to improve language-guided tracking in complex scenes and offer insights for broader multimodal perception tasks.

Abstract

As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

Paper Structure

This paper contains 36 sections, 23 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: From existing RMOT methods to our CDRMT framework by human visual system inspiration. (A) Conventional RMOT approaches typically employ a unified architecture where visual and textual features are plainly fused before tracking. (B) The human visual system processes information through two distinct pathways: the ventral ("what") stream for object recognition and the dorsal ("where") stream for spatial processing, which serves as the biological inspiration for our method. (C) Our proposed Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework explicitly decouples and separately processes natural language description information, while introducing structural consistency constraints to enhance bidirectional semantic understanding between vision and language.
  • Figure 2: The architecture of our proposed Cognitive Disentanglement for Referring Multi-Object Tracking framework. It processes sequential video frames through three collaborative components: (A)The Bidirectional Interactive Fusion module first establishes cross-modal connections while preserving modality-specific characteristics. (B) The Progressive Semantic-Decoupled Query Learning (PSDQL) module, inspired by the dual-stream ("what"/"where" ) processing mechanism in human visual system, separates language into static object attributes ("what" pathway) and spatial motion information ("where" pathway) to guide object queries. (C) The Structural Consistency Constraint (SCC) mechanism is only applied during the training stage, which enforces geometric consistency between original text embeddings and their reconstructed counterparts to enhance semantic alignment between visual objects and linguistic descriptions.
  • Figure 3: Overview of the Bidirectional Interactive Fusion module. The"Bidirectional" specifically refers to the cross-modal interaction pattern between visual and language modalities, rather than a forward-backward process.
  • Figure 4: Overview of the Query Semantic Injection (QSI) module architecture. It efficiently incorporates disentangled semantic features ($\hat{\mathbf{f}}_{so}$ for static object attributes or $\hat{\mathbf{f}}_{sm}$ for spatial motion information) into query representations $Q_t$. The Attention Active Module (AAM) dynamically enhances feature saliency through multi-perspective feature aggregation, generating adaptive attention weights that modulate the cross-attention process. This enables queries to selectively assimilate significant information, facilitating fine-grained semantic-aware object localization.
  • Figure 5: (A) Overview of our Referring Multi-Object Tracking model with structural consistency constraints. (B) Point-wise consistency constraint enforces direct correspondence between entity embeddings ($e_i$ and $e'_i$) through distance-based matching ($\mathcal{L}_{dis}$). (C) Our structural consistency constraint preserves geometric relationships between entities in both embedding spaces ($\mathcal{E}$ and $\mathcal{E}'$) using both distance and angle consistency losses ($\mathcal{L}_{dis}$ & $\mathcal{L}_{angle}$), ensuring that semantic relationships between objects remain consistent even when specific descriptions vary.
  • ...and 7 more figures