Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL
Fangwei Zhong, Kui Wu, Hai Ci, Churan Wang, Hao Chen
TL;DR
This work tackles embodied visual tracking by uniting visual foundation models with offline reinforcement learning to enable rapid, robust policy learning from offline data. By representing state with text-conditioned segmentation masks, applying a re-targeting mechanism, and training a recurrent policy via CQL-SAC, the approach achieves high sample efficiency and strong generalization to unseen targets and environments. Extensive virtual-environment experiments demonstrate superior training speed (about 1 hour on a consumer GPU), robustness to distractors, and effective sim-to-real transfer to a mobile robot. The framework sets a practical benchmark for EVT, highlighting the value of VFMs and offline RL in embodied vision tasks while outlining avenues for further real-world deployment and reduced data dependence.
Abstract
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.
