Table of Contents
Fetching ...

TrackVLA: Embodied Visual Tracking in the Wild

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, He Wang

TL;DR

TrackVLA presents a unified Vision-Language-Action model for embodied visual tracking, jointly learning target recognition and trajectory planning with a shared LLM backbone. It trains on a large EVT-focused dataset (EVT-Bench) plus open-world VQA data, enabling strong zero-shot and sim-to-real generalization at 10 FPS. The approach uses EVA-CLIP-based observation encoding, a parallel recognition and tracking architecture, and an anchor-based diffusion head to generate waypoints, achieving state-of-the-art results on Gym-UnrealCV and robust performance in complex EVT scenarios. This work advances embodied AI by bridging perception and control under natural language guidance, showcasing scalable training across recognition and tracking tasks with real-world applicability.

Abstract

Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

TrackVLA: Embodied Visual Tracking in the Wild

TL;DR

TrackVLA presents a unified Vision-Language-Action model for embodied visual tracking, jointly learning target recognition and trajectory planning with a shared LLM backbone. It trains on a large EVT-focused dataset (EVT-Bench) plus open-world VQA data, enabling strong zero-shot and sim-to-real generalization at 10 FPS. The approach uses EVA-CLIP-based observation encoding, a parallel recognition and tracking architecture, and an anchor-based diffusion head to generate waypoints, achieving state-of-the-art results on Gym-UnrealCV and robust performance in complex EVT scenarios. This work advances embodied AI by bridging perception and control under natural language guidance, showcasing scalable training across recognition and tracking tasks with real-world applicability.

Abstract

Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

Paper Structure

This paper contains 44 sections, 3 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: TrackVLA is a vision-language-action model capable of simultaneous object recognition and visual tracking, trained on a dataset of 1.7 million samples. It demonstrates robust tracking, long-horizon tracking, and cross-domain generalization across diverse challenging environments.
  • Figure 2: Overall pipeline of TrackVLA. Given a video and a language instruction, TrackVLA outputs either a tracking trajectory for the robot or an answer to the recognition question.
  • Figure 3: Anchor-based Diffusion Action Model.
  • Figure 4: Overview of the training datasets used in TrackVLA. We collect 855K embodied visual tracking samples and 855K open-world recognition samples to jointly enhance the robust recognition and tracking capabilities of TrackVLA.
  • Figure 5: Real-world qualitative results of TrackVLA. TrackVLA is deployed in a zero-shot manner across diverse environments, executing diverse tracking instructions in challenging scenarios.
  • ...and 8 more figures