Table of Contents
Fetching ...

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Fangwei Zhong, Kui Wu, Hai Ci, Churan Wang, Hao Chen

TL;DR

This work tackles embodied visual tracking by uniting visual foundation models with offline reinforcement learning to enable rapid, robust policy learning from offline data. By representing state with text-conditioned segmentation masks, applying a re-targeting mechanism, and training a recurrent policy via CQL-SAC, the approach achieves high sample efficiency and strong generalization to unseen targets and environments. Extensive virtual-environment experiments demonstrate superior training speed (about 1 hour on a consumer GPU), robustness to distractors, and effective sim-to-real transfer to a mobile robot. The framework sets a practical benchmark for EVT, highlighting the value of VFMs and offline RL in embodied vision tasks while outlining avenues for further real-world deployment and reduced data dependence.

Abstract

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

TL;DR

This work tackles embodied visual tracking by uniting visual foundation models with offline reinforcement learning to enable rapid, robust policy learning from offline data. By representing state with text-conditioned segmentation masks, applying a re-targeting mechanism, and training a recurrent policy via CQL-SAC, the approach achieves high sample efficiency and strong generalization to unseen targets and environments. Extensive virtual-environment experiments demonstrate superior training speed (about 1 hour on a consumer GPU), robustness to distractors, and effective sim-to-real transfer to a mobile robot. The framework sets a practical benchmark for EVT, highlighting the value of VFMs and offline RL in embodied vision tasks while outlining avenues for further real-world deployment and reduced data dependence.

Abstract

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.
Paper Structure (33 sections, 1 equation, 12 figures, 9 tables)

This paper contains 33 sections, 1 equation, 12 figures, 9 tables.

Figures (12)

  • Figure 1: The overall framework of our proposed method. For data acquisition, a state-based policy is employed to collect diverse image-action-reward trajectories through interactions with various complex environments. We augment the policy by adding different levels of perturbation in actions. The observed color images are encoded into text-conditioned segmentation masks, which highlight the target object (white), show the obstacle (colorful), and remove the background noise (black). Subsequently, we employ offline reinforcement learning, such as conservative Q-learning (CQL), to train the recurrent policy network, which outputs actions based on the segmentation masks.
  • Figure 2: Examples of an observed RGB image and the corresponding instance segmentation mask (ISM) from UnrealCV qiu2017unrealcv and Text-conditioned segmentation mask from DEVA. The middle one is the original mask from DEVA cheng2023tracking. The two right masks utilize re-targeting to emphasize the target with different text prompts.
  • Figure 3: Left: The exemplars of the visual observation and segmentation mask provided by the vision foundation model in different testing environments. The three snapshots at Urban Road, Urban City, and Snow Village show that the retargeting mechanism can distinguish the target, obstacles, and distractors with different colors in the mask. The sequence at (Parking Lot) shows that the VFM can consistently identify the target even when the distractor fully occludes the target. Right: The learning curve of different state representations for offline RL validated in Complex Room includes the mean and standard deviation of results obtained from training with 3 seeds.
  • Figure 4: The output actions of our agent in two video sequences from VOT Challenge VOT_TPAMI. We separately use "Deer" and "Person" as prompts. The central red point signifies the image's center. The bottom green rectangle's width indicates angular velocity control from -30°/s to 30°/s, while the bottom blue rectangle's height indicates linear velocity control from -1 m/s to 1 m/s.
  • Figure 5: The agent is deployed on a real robot, following a human in a complex corridor, navigating corners and hallways, involving pedestrians and a target occlusion.
  • ...and 7 more figures