VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang
TL;DR
The paper tackles inefficiencies in perception for 3D robotic manipulation with multi-camera setups by introducing VERM, a GPT-4o-guided virtual eye that imagines a task-adaptive view from a 3D point cloud. It combines a depth-aware action model with a dynamic coarse-to-fine Refinement to enable efficient 3D manipulation from a single virtual image. Across RLBench and real-world tests, VERM achieves significant training and inference speedups while maintaining high task success, and it generalizes across multiple foundation models. The approach reduces input complexity without sacrificing performance and demonstrates robust performance with limited supervision. It also outlines promising directions for dynamic view updates and broader task applicability.
Abstract
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .
