Hierarchical Instruction-aware Embodied Visual Tracking
Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong
TL;DR
The paper tackles User-Centric Embodied Visual Tracking (UC-EVT), where translating high-level user instructions into reliable, real-time tracking actions is challenging due to speed and generalization limits of large models. It proposes Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT), which uses an LLM-based Semantic-Spatial Goal Aligner to convert instructions into spatial goals and a RL-based Adaptive Goal-Aligned Policy to realize those goals, with an offline-goal conditioned training regime and retrieval-augmented correction. A 10-million-step trajectory dataset and evaluation across 10 virtual environments plus three real-world deployments demonstrate strong generalization to unseen environments and robust performance under dynamic instruction changes, outperforming several baselines including GPT-4o and OpenVLA. The approach balances semantic understanding and real-time control by decoupling reasoning from action, offering a scalable, instruction-responsive framework for embodied spatial intelligence with public data and code release. $\mathcal{D}(\mathcal{I}_t, \mathcal{S}_t)$ is minimized over episodes, and spatial goals serve as intermediaries between language and motion, enabling practical, user-centric tracking in diverse settings.
Abstract
User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.
