Table of Contents
Fetching ...

Hierarchical Instruction-aware Embodied Visual Tracking

Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong

TL;DR

The paper tackles User-Centric Embodied Visual Tracking (UC-EVT), where translating high-level user instructions into reliable, real-time tracking actions is challenging due to speed and generalization limits of large models. It proposes Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT), which uses an LLM-based Semantic-Spatial Goal Aligner to convert instructions into spatial goals and a RL-based Adaptive Goal-Aligned Policy to realize those goals, with an offline-goal conditioned training regime and retrieval-augmented correction. A 10-million-step trajectory dataset and evaluation across 10 virtual environments plus three real-world deployments demonstrate strong generalization to unseen environments and robust performance under dynamic instruction changes, outperforming several baselines including GPT-4o and OpenVLA. The approach balances semantic understanding and real-time control by decoupling reasoning from action, offering a scalable, instruction-responsive framework for embodied spatial intelligence with public data and code release. $\mathcal{D}(\mathcal{I}_t, \mathcal{S}_t)$ is minimized over episodes, and spatial goals serve as intermediaries between language and motion, enabling practical, user-centric tracking in diverse settings.

Abstract

User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.

Hierarchical Instruction-aware Embodied Visual Tracking

TL;DR

The paper tackles User-Centric Embodied Visual Tracking (UC-EVT), where translating high-level user instructions into reliable, real-time tracking actions is challenging due to speed and generalization limits of large models. It proposes Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT), which uses an LLM-based Semantic-Spatial Goal Aligner to convert instructions into spatial goals and a RL-based Adaptive Goal-Aligned Policy to realize those goals, with an offline-goal conditioned training regime and retrieval-augmented correction. A 10-million-step trajectory dataset and evaluation across 10 virtual environments plus three real-world deployments demonstrate strong generalization to unseen environments and robust performance under dynamic instruction changes, outperforming several baselines including GPT-4o and OpenVLA. The approach balances semantic understanding and real-time control by decoupling reasoning from action, offering a scalable, instruction-responsive framework for embodied spatial intelligence with public data and code release. is minimized over episodes, and spatial goals serve as intermediaries between language and motion, enabling practical, user-centric tracking in diverse settings.

Abstract

User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.

Paper Structure

This paper contains 52 sections, 5 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Examples of User-Centric embodied visual tracking with diverse instructions.
  • Figure 2: Overview of the Hierarchical Instruction-aware Embodied Visual Tracker (HIEVT). Given a natural language instruction and environmental observation, our system first processes the instruction through the LLM-based Semantic-Spatial Goal Aligner including Semantic Parsing, Spatial-Goal Generation, and Retrieval-Augmented Goal Correction. This produces a target attribute and a bounding box format spatial goal. The RL-based Adaptive Goal-Aligned Policy then combines this goal with the Visual Foundation Model (VFM) processed observation, feeds them into the following policy network. The Goal State Aligner and Recurrent Policy then generate appropriate action signals to maintain the desired spatial relationship with the target.
  • Figure 3: The examples of virtual and real-world environments used in our experiments. The FlexibleRoom environment is used for training data collection, featuring diverse augmentable factor. the nine photo-realistic environments in the middle are used for quantitative evaluation, we also deploy our proposed method on three real-world scenarios to validate the effectiveness and transferability.
  • Figure 4: Pixel-level distance between target center and goal center across time steps. The spike at step #101 represents a goal shift instruction, followed by rapid corrections (steps #107, #112) as our agent adjusts to the new spatial goal.
  • Figure 5: We deploy the agent into a wheel robot in a real-world scenario. The sequence shows the robot responding to the user's textual instruction with real-time adjustments in tracking and positioning. Visual annotation (white color mask) is generated by SAM-Track with text prompt "person", the red bounding box is generated by LLM parser based on instructions and observation.
  • ...and 8 more figures