Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation
Yuelei Li, Ge Yan, Annabella Macaluso, Mazeyu Ji, Xueyan Zou, Xiaolong Wang
TL;DR
The paper addresses autonomous long-horizon robotic manipulation in dynamic real-world environments, where purely language-based planners struggle to ground plans visually. It proposes LMM-3DP, a framework that couples an LMM planner (via visual feedback) with a language-conditioned 3D skill policy that fuses 2D semantic features and 3D geometry through a 3D transformer, and augments planning with a critic, memory, and a growing skill library for continual improvement. Key contributions include a closed-loop planning-execution architecture, empirical demonstrations of approximately 1.5x improvements in planning accuracy and 1.45x in low-level success on kitchen tasks, and ablative evidence showing the importance of visual feedback and the critic. The approach advances autonomy and robustness in open-world manipulation and provides a scalable pathway for integrating high-level reasoning with grounded 3D action policies, with a project page available for demos. $
Abstract
The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment of 3D feature fields have expanded the horizons of robotic capabilities. These developments hold significant potential for bridging the gap between high-level reasoning from LMMs and low-level control policies utilizing 3D feature fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners and 3D skill Policies. Our approach consists of three key perspectives: high-level planning, low-level control, and effective integration. For high-level planning, LMM-3DP supports dynamic scene understanding for environment disturbances, a critic agent with self-feedback, history policy memorization, and reattempts after failures. For low-level control, LMM-3DP utilizes a semantic-aware 3D feature field for accurate manipulation. In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration. We extensively evaluate our approach across multiple skills and long-horizon tasks in a real-world kitchen environment. Our results show a significant 1.45x success rate increase in low-level control and an approximate 1.5x improvement in high-level planning accuracy compared to LLM-based baselines. Demo videos and an overview of LMM-3DP are available at https://lmm-3dp-release.github.io.
