TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Jiahang Liu; Yunpeng Qi; Jiazhao Zhang; Minghan Li; Shaoan Wang; Kui Wu; Hanjing Ye; Hong Zhang; Zhibo Chen; Fangwei Zhong; Zhizheng Zhang; He Wang

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang

TL;DR

TrackVLA++ advances embodied visual tracking by adding Polar-CoT spatial reasoning and a memory-based Target Identification Memory to previous Vision-Language-Action frameworks. This combination enables robust, long-horizon tracking under severe occlusions and distractors across egocentric and multi-view settings, achieving state-of-the-art results on EVT-Bench and Gym-UnrealCV, with strong zero-shot real-world generalization. The approach maintains efficiency by encoding spatial relations as compact tokens and gating memory updates by confidence, resulting in improved accuracy without prohibitive computational cost. Overall, the method demonstrates significant practical impact for real-world EVT in dynamic environments and multi-camera setups.

Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TL;DR

Abstract

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)