Table of Contents
Fetching ...

QTrack: Query-Driven Reasoning for Multi-modal MOT

Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak, Moloud Abdar, Janibul Bashir

Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack

QTrack: Query-Driven Reasoning for Multi-modal MOT

Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack
Paper Structure (56 sections, 40 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 56 sections, 40 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison of tracking paradigms. (a) Traditional MOT follows a tracking-by-detection paradigm, tracking all objects from predefined categories regardless of user intent. (b) QTrack enables reasoning-aware, query-conditioned tracking: given a video and natural language query, it selectively identifies and tracks only the specified targets, shifting from all-object tracking to semantic, user-driven tracking.
  • Figure 2: Overview of the QTrack Framework architecture. Given an input video sequence $\mathcal{V}=\{I_t\}_{t=1}^{T}$ and natural-language query $q$. QTrack processes the data through a unified vision-language model (VLM). The model first generates a chain-of-thought reasoning trace, analyzing the query in the context of a visual scene to identify targets based on attributes, relationship or motion. The model directly predicts bounding box trajectories $\{\tau_i\}_{i=1}^{N_q}$ for the queried target across all frames, performing joint spatial grounding and temporal association.
  • Figure 3: RMOT26 dataset construction pipeline. This pipeline explains the creation of benchmark instances from existing MOT datasets.
  • Figure 4: Ablation analysis of QTrack components. TAPO significantly improves motion consistency and localization stability over GRPO.
  • Figure 5: In this figure, Visionreasoner is not able to detect and track all the objects in the frames, but QTrack is able to detect and track all objects in all frames.
  • ...and 6 more figures