Table of Contents
Fetching ...

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots

Zhixuan Xu, Chongkai Gao, Zixuan Liu, Gang Yang, Chenrui Tie, Haozhuo Zheng, Haoyu Zhou, Weikun Peng, Debang Wang, Tianrun Hu, Tianyi Chen, Zhouliang Yu, Lin Shao

TL;DR

ManiFoundation introduces a contact-synthesis based foundation model for general-purpose robotic manipulation, capable of handling arbitrary objects and robots by predicting contact points and forces from object/hand point clouds, properties, and target motions. It combines a feature extractor with a CVAE to generate multimodal contact proposals and an optimization stage to refine robot hand pose to satisfy wrench and collision constraints. A large-scale annotated dataset across rigid, articulated rigid, and deformable objects supports training, and extensive simulations and real-world experiments show approximately 90% average success across categories, validating broad generalization. The work enables integration with higher-level planners (LLMs/VLMs) for long-horizon tasks, bringing low-level manipulation closer to human-like versatility.

Abstract

To substantially enhance robot intelligence, there is a pressing need to develop a large model that enables general-purpose robots to proficiently undertake a broad spectrum of manipulation tasks, akin to the versatile task-planning ability exhibited by LLMs. The vast diversity in objects, robots, and manipulation tasks presents huge challenges. Our work introduces a comprehensive framework to develop a foundation model for general robotic manipulation that formalizes a manipulation task as contact synthesis. Specifically, our model takes as input object and robot manipulator point clouds, object physical attributes, target motions, and manipulation region masks. It outputs contact points on the object and associated contact forces or post-contact motions for robots to achieve the desired manipulation task. We perform extensive experiments both in the simulation and real-world settings, manipulating articulated rigid objects, rigid objects, and deformable objects that vary in dimensionality, ranging from one-dimensional objects like ropes to two-dimensional objects like cloth and extending to three-dimensional objects such as plasticine. Our model achieves average success rates of around 90\%. Supplementary materials and videos are available on our project website at https://manifoundationmodel.github.io/.

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots

TL;DR

ManiFoundation introduces a contact-synthesis based foundation model for general-purpose robotic manipulation, capable of handling arbitrary objects and robots by predicting contact points and forces from object/hand point clouds, properties, and target motions. It combines a feature extractor with a CVAE to generate multimodal contact proposals and an optimization stage to refine robot hand pose to satisfy wrench and collision constraints. A large-scale annotated dataset across rigid, articulated rigid, and deformable objects supports training, and extensive simulations and real-world experiments show approximately 90% average success across categories, validating broad generalization. The work enables integration with higher-level planners (LLMs/VLMs) for long-horizon tasks, bringing low-level manipulation closer to human-like versatility.

Abstract

To substantially enhance robot intelligence, there is a pressing need to develop a large model that enables general-purpose robots to proficiently undertake a broad spectrum of manipulation tasks, akin to the versatile task-planning ability exhibited by LLMs. The vast diversity in objects, robots, and manipulation tasks presents huge challenges. Our work introduces a comprehensive framework to develop a foundation model for general robotic manipulation that formalizes a manipulation task as contact synthesis. Specifically, our model takes as input object and robot manipulator point clouds, object physical attributes, target motions, and manipulation region masks. It outputs contact points on the object and associated contact forces or post-contact motions for robots to achieve the desired manipulation task. We perform extensive experiments both in the simulation and real-world settings, manipulating articulated rigid objects, rigid objects, and deformable objects that vary in dimensionality, ranging from one-dimensional objects like ropes to two-dimensional objects like cloth and extending to three-dimensional objects such as plasticine. Our model achieves average success rates of around 90\%. Supplementary materials and videos are available on our project website at https://manifoundationmodel.github.io/.
Paper Structure (38 sections, 7 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 7 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: We propose a ManiFoundation Model that can generalize over a diverse range of robots and objects, and perform various kinds of manipulation tasks based on 3D point cloud input.
  • Figure 2: The pipeline of our ManiFoundation model. Left: we decompose a manipulation task to a sequence of object point cloud motions from either VLM-based planning or a flow model. Middle: we train a ManiFoundation network to predict the contact point and force heatmap for each motion of the sequence. Right: we acquire the robot pose for execution based on optimization with the initial results from the contact point and force heatmaps.
  • Figure 3: Overview of our ManiFoundation network. The feature extractor module incorporates the information from both object and robot point clouds, and the CVAE module generates the contact point and force maps on the given object.
  • Figure 4: Visualizations of evaluations on deformable objects are shown. The point cloud colors represent the contact heatmap, with purple indicating low values and yellow indicating high values. Arrows denote the force or motion direction. This representation is consistent across all figures.
  • Figure 5: Visualizations of how physical properties affect network outputs. Blue squares and cylinders are objects. Overlapping images with transparency represent the moved object. The orange arrow represents the motion.
  • ...and 6 more figures