Table of Contents
Fetching ...

Tracking with Human-Intent Reasoning

Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, Xuansong Xie

TL;DR

This paper introduces instruction tracking, a new task where implicit human instructions govern video object tracking, addressing the practical burden of manually specifying targets. It presents TrackGPT, a tracker that uses a Large Vision-Language Model as a reasoning brain to interpret instructions and generate referring embeddings, complemented by a cross-frame referring propagation mechanism and a rethinking module to maintain alignment with the instruction purport. An InsTrack benchmark with over 1k instruction-video pairs is proposed for tuning and evaluation, and TrackGPT achieves competitive results on standard referring tracking benchmarks while attaining state-of-the-art performance on instruction tracking, notably $66.5$ in $\\mathcal{J}\\&\\\mathcal{F}$ on Refer-DAVIS$_{17}$ and $54.9$ in $\\mathcal{J}\\&\\\mathcal{F}$ on InsTrack. The work demonstrates the potential of integrating LVLM-based reasoning into online tracking, enabling more intelligent, instruction-grounded perception with practical applications in interactive perception systems.

Abstract

Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-Language Model (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.

Tracking with Human-Intent Reasoning

TL;DR

This paper introduces instruction tracking, a new task where implicit human instructions govern video object tracking, addressing the practical burden of manually specifying targets. It presents TrackGPT, a tracker that uses a Large Vision-Language Model as a reasoning brain to interpret instructions and generate referring embeddings, complemented by a cross-frame referring propagation mechanism and a rethinking module to maintain alignment with the instruction purport. An InsTrack benchmark with over 1k instruction-video pairs is proposed for tuning and evaluation, and TrackGPT achieves competitive results on standard referring tracking benchmarks while attaining state-of-the-art performance on instruction tracking, notably in on Refer-DAVIS and in on InsTrack. The work demonstrates the potential of integrating LVLM-based reasoning into online tracking, enabling more intelligent, instruction-grounded perception with practical applications in interactive perception systems.

Abstract

Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-Language Model (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.
Paper Structure (18 sections, 4 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of different tracking paradigms. (a) Tracking by box/mask template. (b) Tracking by explicit linguistic description. (c) Tracking by human instruction. (d) Differences in existing tracking tasks. SOT: single object tracking. VLT: vision-language tracking. VOS: video object segmentation. RVOS: referring video object segmentation. InsT: instruction tracking.
  • Figure 2: Overview architecture of TrackGPT. The current frame $\bm{I}_{t}$ is fed into a visual encoder to extract visual features. The initial frame $\bm{I}_0$ and the corresponding tracking instruction are sent into an LVLM brain to comprehend human intent and generate the referring queries $\bm{Q}_R$ for the target object. Finally, the decoder receives the visual features and linguistic embeddings, predicting the tracking results $\bm{m}_t$. The red arrows in pipeline indicate the proposed rethinking mechanism, and $\bm{Q}_t\rightarrow\bm{Q}_{t+1}$ represents the cross-frame referring propagation.
  • Figure 3: The proposed cross-frame referring propagation. An initial referring query $\bm{Q}_R$ and a cross-frame online referring query $\bm{Q}_t$ are responsible for decoding the target object mask.
  • Figure 4: Quantitative results from InsTrack test set. TrackGPT comprehends the human instruction and accurately tracks the target object.