Table of Contents
Fetching ...

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

Xuetao Li, Wenke Huang, Nengyuan Pan, Kaiyan Zhao, Songhua Yang, Yiming Wang, Mengde Li, Mang Ye, Jifeng Xuan, Miao Li

TL;DR

The paper tackles the challenge of data-inefficient, geometry-blind multimodal manipulation in humanoid robotics. It proposes RGMP, an end-to-end framework that combines a Geometric-prior Skill Selector (GSS) for geometry-informed skill planning with an Adaptive Recursive Gaussian Network (ARGN) for data-efficient, multi-scale spatial reasoning and motion synthesis via a Gaussian Mixture model. Key contributions include the plug-and-play GSS with geometric adapters, RoPE-based spatial memory and adaptive decay in ARGN, and a multi-modal action representation that preserves distinct manipulation modes. Empirical results on two robotic platforms show 87% generalization and fivefold data efficiency gains over state-of-the-art baselines, demonstrating strong cross-domain generalization and practical impact for real-world humanoid manipulation.

Abstract

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

TL;DR

The paper tackles the challenge of data-inefficient, geometry-blind multimodal manipulation in humanoid robotics. It proposes RGMP, an end-to-end framework that combines a Geometric-prior Skill Selector (GSS) for geometry-informed skill planning with an Adaptive Recursive Gaussian Network (ARGN) for data-efficient, multi-scale spatial reasoning and motion synthesis via a Gaussian Mixture model. Key contributions include the plug-and-play GSS with geometric adapters, RoPE-based spatial memory and adaptive decay in ARGN, and a multi-modal action representation that preserves distinct manipulation modes. Empirical results on two robotic platforms show 87% generalization and fivefold data efficiency gains over state-of-the-art baselines, demonstrating strong cross-domain generalization and practical impact for real-world humanoid manipulation.

Abstract

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.

Paper Structure

This paper contains 13 sections, 13 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of our framework. By applying semantic cues from human instructions with common sense information derived from visual perception, our RGMP formulates the robot-targets spatial relationships for tasks. RGMP achieves an 8% performance improvement and exhibits 5× greater data efficiency than Diffusion Policy.
  • Figure 2: Pipeline of RGMP. Upon receiving a speech command, the robot utilizes GSS to identify and localize the target object. By integrating object coordinates, shape cues (from Yolov8n-seg yaseen2024yolov9 model $\phi()$), and geometric-prior knowledge, the robot selects an appropriate skill from the skill library, each associated with a pretrained RGMP model. The selected RGMP model then executes the task precisely through adaptive recursive feature extraction and GMM-based refinement.
  • Figure 3: Structure of (a) Spatial Mixing Block and (b) Channel Mixing Block. The Spatial Mixing Block integrates an ADM for Dynamic Decay $\mathcal{W}$ and RoPE for directional awareness, enhancing spatial aggregation. The Channel Mixing Block reallocates channel-wise feature responses by integrating correlations between channels.
  • Figure 4: Pipeline of human-robot interactions. We validate models on the task of "passing me the tissue", with a training dataset comprising only 40 instances of tissue pinching actions. Our RGMP performs better than DP (Diffusion Policy).
  • Figure 5: Generalization ability of RGMP. We test RGMP on grasping various unseen objects at random positions. Despite being trained on only 40 demonstrations of grasping a Fanta, RGMP reliably grasped the can from any position and generalized this proficiency to unseen objects like a Coke bottle, a spray can, and human hand, demonstrating remarkable versatility.
  • ...and 1 more figures