Table of Contents
Fetching ...

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang

TL;DR

TrackVLA++ advances embodied visual tracking by adding Polar-CoT spatial reasoning and a memory-based Target Identification Memory to previous Vision-Language-Action frameworks. This combination enables robust, long-horizon tracking under severe occlusions and distractors across egocentric and multi-view settings, achieving state-of-the-art results on EVT-Bench and Gym-UnrealCV, with strong zero-shot real-world generalization. The approach maintains efficiency by encoding spatial relations as compact tokens and gating memory updates by confidence, resulting in improved accuracy without prohibitive computational cost. Overall, the method demonstrates significant practical impact for real-world EVT in dynamic environments and multi-camera setups.

Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TL;DR

TrackVLA++ advances embodied visual tracking by adding Polar-CoT spatial reasoning and a memory-based Target Identification Memory to previous Vision-Language-Action frameworks. This combination enables robust, long-horizon tracking under severe occlusions and distractors across egocentric and multi-view settings, achieving state-of-the-art results on EVT-Bench and Gym-UnrealCV, with strong zero-shot real-world generalization. The approach maintains efficiency by encoding spatial relations as compact tokens and gating memory updates by confidence, resulting in improved accuracy without prohibitive computational cost. Overall, the method demonstrates significant practical impact for real-world EVT in dynamic environments and multi-camera setups.

Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Real-world demonstration of TrackVLA++. TrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling superior performance in both long-horizon and highly crowded tracking scenarios.
  • Figure 2: The pipeline of TrackVLA++. Given a video stream and a language instruction, TrackVLA++ predicts a tracking trajectory by utilizing Polar-CoT reasoning to infer the target's position and continuously updating the Target Identification Memory with CoT-based predictions for long-horizon tracking.
  • Figure 3: Real-world system architecture.
  • Figure 4: Visualizations of the Simulation Experiments. TrackVLA++ performs well under occlusion and interference conditions. The upper-left inset displays the Polar-CoT prediction, with the red area indicating the predicted target position, and the visualization on EVT-Bench is cropped to a front sector for conciseness. Zoom in for a better view.
  • Figure 5: Visualizations of the Real World Experiments. We evaluate TrackVLA++ on three different tasks: Obstacle, Winding Path, and Distractor, showcasing the tracking performance during target disappearance and occlusion. The bar chart provides a quantitative comparison of success rate between TrackVLA and TrackVLA++, highlighting the improved performance of our method.