Table of Contents
Fetching ...

Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

Yuqing Huang, Guotian Zeng, Zhenqiao Yuan, Zhenyu He, Xin Li, Yaowei Wang, Ming-Hsuan Yang

Abstract

Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.

Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

Abstract

Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.

Paper Structure

This paper contains 28 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Interactive tracking in a basketball sequence. User prompts guide the tracker through state changes, target changes, and global retargeting, demonstrating the interaction loop required for real-world analysis.
  • Figure 2: Overview of the interactive evaluation protocol. Each video is divided into segments by user prompts. Dashed lines denote ground-truth trajectories, and colored curves represent tracker predictions. At each prompt, the tracker must either update its prediction or switch targets based on user input. $\checkmark$ and $\times$ denote correct and incorrect responses, respectively.
  • Figure 3: Overview of the proposed Interactive Memory-Augmented Tracking (IMAT) framework. The user provides natural-language descriptions of the target to the interactive perception module, which performs reasoning-guided grounding. The cognitive arbitration module then compares the grounded result with the prediction of the tracker to either correct the trajectory and update the memory banks or confirm the current tracking state.
  • Figure 4: Evaluation results on the six scenarios using the success rate in the InteractTrack benchmark. These figures show that the proposed baseline performs strongly across all six scenarios.
  • Figure 5: Representative sequences from the daily activities scenario. This scenario includes everyday indoor and outdoor activities captured from a third-person viewpoint. The sequences depict interactions such as playing with pets, handling household objects, and casual human motion under moderate viewpoint shifts. The highlighted sequence illustrates target-switch interaction cues.
  • ...and 8 more figures