Table of Contents
Fetching ...

Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su

TL;DR

Rethinking Two-Stage Referring-by-Tracking tackles the core limitations of two-stage RMOT pipelines by replacing heuristic feature construction and CLIP-based correspondence with a sampling-based Conditioning Hook (C-Hook) and a learnable Pairwise Correspondence Decoder (PCD). This design preserves backbone gradient flow, injects language-conditioned cues, and enables active pairwise discrimination across modalities, yielding substantial accuracy and efficiency gains across Refer-KITTI/v2, Refer-Dance, and LaMOT. The framework demonstrates that two-stage RBT can surpass many one-stage approaches while maintaining modularity and flexible deployment, thanks to CLIP-free matching and open integration with existing detectors and trackers. Overall, FlexHook significantly strengthens the two-stage RMOT paradigm, offering practical benefits for open-world, incremental RMOT applications.

Abstract

Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the Supplementary Materials.

Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

TL;DR

Rethinking Two-Stage Referring-by-Tracking tackles the core limitations of two-stage RMOT pipelines by replacing heuristic feature construction and CLIP-based correspondence with a sampling-based Conditioning Hook (C-Hook) and a learnable Pairwise Correspondence Decoder (PCD). This design preserves backbone gradient flow, injects language-conditioned cues, and enables active pairwise discrimination across modalities, yielding substantial accuracy and efficiency gains across Refer-KITTI/v2, Refer-Dance, and LaMOT. The framework demonstrates that two-stage RBT can surpass many one-stage approaches while maintaining modularity and flexible deployment, thanks to CLIP-free matching and open integration with existing detectors and trackers. Overall, FlexHook significantly strengthens the two-stage RMOT paradigm, offering practical benefits for open-world, incremental RMOT applications.

Abstract

Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the Supplementary Materials.

Paper Structure

This paper contains 25 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Tracking-by-Referring associates trajectories within boxes located by GroundingDINO liu2023grounding. (b) One-stage Referring-by-Tracking projects queries decoded by MOTR 10.1007/978-3-031-19812-0_38 to matching scores based on expressions.
  • Figure 2: (a) Previous two-stage RBT reuses the encoder and computes CLIP-based pmlr-v139-radford21a cosine similarity as the matching score. (b) Our method directly hooks features from the visual backbone and decodes scores via PCD in a composed feature space.
  • Figure 3: The overall framework of FlexHook. FlexHook directly extracts features from multi-scale feature maps via C-Hook during the original workflow, without any additional encoding stages. C-Hook consists of two components: Neighboring Grid Sampling, which samples target features $F_J$ based on trajectory bounding boxes $\mathcal{B}^i_{t:t+p}$, and Conditioning Enhancement, which samples reference features $F_r$ conditioned on linguistic features $F_l$. The sampled features are fused across frames $p$ through Temporal Integration. The multi-scale features are finally aggregated using a feature pyramid Lin_2017_CVPR and decoded layer-by-layer in PCD to generate the final matching scores.
  • Figure 4: Illustration of Conditioning Enhancement. Guided by the linguistic feature $F_l$, we computes $M$ reference points $P_r$ through a Transformer decoder and a residual MLP followed by a sigmoid function.
  • Figure 5: Illustration of the Temporal Integration. We concatenate multi-frame features with grid displacements, then compress them along the channel dimension.
  • ...and 1 more figures