Table of Contents
Fetching ...

VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models

Kui Wu, Shuhang Xu, Hao Chen, Churan Wang, Zhoujun Li, Yizhou Wang, Fangwei Zhong

TL;DR

This work tackles Embodied Visual Tracking by addressing prolonged target loss through a self-improving framework that couples a fast tracking policy with Vision-Language Model reasoning activated on failure. A memory-augmented reflection mechanism enables the VLM to learn from past failures and progressively improve 3D spatial reasoning, refining recovery actions through retrieved exemplars. The approach yields large gains over strong baselines, with improvements up to $72\%$ over state-of-the-art RL methods and $220\%$ over PID-based tracking in challenging environments, demonstrating the first integration of VLM-based failure recovery for EVT. These results suggest substantial practical impact for real-world robotics requiring continuous target monitoring in dynamic, unstructured settings, and point to future work in speeding up reasoning and broadening applicability to navigation and human–robot interaction tasks.

Abstract

We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs' limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by $72\%$ with state-of-the-art RL-based approaches and $220\%$ with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.

VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models

TL;DR

This work tackles Embodied Visual Tracking by addressing prolonged target loss through a self-improving framework that couples a fast tracking policy with Vision-Language Model reasoning activated on failure. A memory-augmented reflection mechanism enables the VLM to learn from past failures and progressively improve 3D spatial reasoning, refining recovery actions through retrieved exemplars. The approach yields large gains over strong baselines, with improvements up to over state-of-the-art RL methods and over PID-based tracking in challenging environments, demonstrating the first integration of VLM-based failure recovery for EVT. These results suggest substantial practical impact for real-world robotics requiring continuous target monitoring in dynamic, unstructured settings, and point to future work in speeding up reasoning and broadening applicability to navigation and human–robot interaction tasks.

Abstract

We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs' limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by with state-of-the-art RL-based approaches and with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.

Paper Structure

This paper contains 15 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of tracking capabilities between traditional tracking method and our proposed VLM-enhanced Embodied Visual Tracking approach. When the target is occluded by obstacles (boxes and pillars), the traditional method may lose track and fail (Red), while our VLM-enhanced approach analyzes the target's trajectory, and surrounding environment, and actively attempts to recover the target by reasoning about its possible location behind the pillar (Green).
  • Figure 2: The framework for integrating Vision-Language Models (VLMs) with active tracking policies. The framework follows a structured recovery approach when target tracking fails, consisting of five main steps: (1) Failure Detection that uses segmentation-based target identification; (2) Failure Case Analysis through chain-of-thought reasoning to understand the environmental context; (3) Movement Suggestions that provide structured action plans with directional, environmental, and conditional triggers; and (4) Memory-Augmented Self-Reflection that plans recovery sequences and optimizes actions based on stored experiences. (5)Reflection Insights Generation that summarizes the entire recovery plan and gives an adjustment suggestion when the plan fails. The framework enables agents to recover from occlusions and extended target loss by leveraging the reasoning capabilities of VLMs, e.g., GPT-4o, memory management for historical context, self-reflection for continuous improvement, and a robust visual tracking policy for sustained tracking once recovery is achieved.
  • Figure 3: Four high-fidelity virtual environments used for testing the embodied visual tracking agents. The four environments are built on Unreal Engine 5 and UnrealCV qiu2017unrealcv to simulate real-world challenges.
  • Figure 4: Visualization of a successful recovery sequence in Old Factory. Red frames show failure events. Blue frames in the middle indicate recovery action (orange arrows). Green frames show successful target reacquisition. The bottom section presents the system's reasoning process, including failure case analysis, movement suggestions, memory-augmented self-reflection, and reflection insights generation that compares expected versus actual behavior.