GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation

Weiliang Tang; Jia-Hui Pan; Yun-Hui Liu; Masayoshi Tomizuka; Li Erran Li; Chi-Wing Fu; Mingyu Ding

GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation

Weiliang Tang, Jia-Hui Pan, Yun-Hui Liu, Masayoshi Tomizuka, Li Erran Li, Chi-Wing Fu, Mingyu Ding

TL;DR

GeoManip presents a training-free framework that translates natural language task descriptions into precise robot manipulation via geometric constraints. It introduces a geometry parser, a constraint generator, and a cost-function–based trajectory solver, augmented by a GPT-4o–driven reasoning loop that yields constraint sets and corresponding cost functions $f: \mathcal{P} \to \mathbb{R}^+$ and an optimization over $R \in SE(3)$ and $\\mathbf{t} \in \mathbb{R}^3$ to minimize $\\min_{R,t} \\frac{1}{K^s} \\sum_{f \in \\mathcal{F}^s} f(...) + \alpha \|\\mathbf{t}-\\mathbf{t}_0\|_2 + \beta \\|\\mathrm{euler}(R R_0^{-1})\\|_1$. The approach demonstrates strong generalization to unseen tasks and objects in both simulation and real-world settings without additional training, outperforming training-based baselines and a geometry-only method. The framework enables five interaction capabilities—on-the-fly adaptation, learning from failures and demonstrations, long-horizon planning, and efficient imitation-learning data collection—making robotic manipulation more generalizable and data-efficient. Overall, GeoManip offers a scalable, interpretable interface that bridges language, geometry, and action for generalist robots with practical impact in diverse manipulation tasks.

Abstract

We present GeoManip, a framework to enable generalist robots to leverage essential conditions derived from object and part relationships, as geometric constraints, for robot manipulation. For example, cutting the carrot requires adhering to a geometric constraint: the blade of the knife should be perpendicular to the carrot's direction. By interpreting these constraints through symbolic language representations and translating them into low-level actions, GeoManip bridges the gap between natural language and robotic execution, enabling greater generalizability across diverse even unseen tasks, objects, and scenarios. Unlike vision-language-action models that require extensive training, operates training-free by utilizing large foundational models: a constraint generation module that predicts stage-specific geometric constraints and a geometry parser that identifies object parts involved in these constraints. A solver then optimizes trajectories to satisfy inferred constraints from task descriptions and the scene. Furthermore, GeoManip learns in-context and provides five appealing human-robot interaction features: on-the-fly policy adaptation, learning from human demonstrations, learning from failure cases, long-horizon action planning, and efficient data collection for imitation learning. Extensive evaluations on both simulations and real-world scenarios demonstrate GeoManip's state-of-the-art performance, with superior out-of-distribution generalization while avoiding costly model training.

GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation

TL;DR

and an optimization over

and

to minimize

. The approach demonstrates strong generalization to unseen tasks and objects in both simulation and real-world settings without additional training, outperforming training-based baselines and a geometry-only method. The framework enables five interaction capabilities—on-the-fly adaptation, learning from failures and demonstrations, long-horizon planning, and efficient imitation-learning data collection—making robotic manipulation more generalizable and data-efficient. Overall, GeoManip offers a scalable, interpretable interface that bridges language, geometry, and action for generalist robots with practical impact in diverse manipulation tasks.

Abstract

Paper Structure (36 sections, 1 equation, 13 figures, 4 tables)

This paper contains 36 sections, 1 equation, 13 figures, 4 tables.

Introduction
Related Work
Methods
Task Decomposition and Process Control
Geometry Parser
Constraint Generator
Cost Functions and Trajectory Generation
Generalist Embodied Agent
Experiments
Implementation Details
Results on Virtual Benchmarks
Experiments on Real Environment
Generalist Embodied Agent for Robotic Manipulation
On-the-fly Policy Adaptation
Learn from Failure Cases
...and 21 more sections

Figures (13)

Figure 1: We propose to derive geometric constraints to bridge the gap between high-level language descriptions and low-level robot actions. As demonstrated in our results, our GeoManip is able to execute diverse tasks in general settings.
Figure 2: Given the user's task description, our method decomposes the task into multiple sub-tasks and forms the process control. For each stage, we first design a geometry parser to segment and obtain the point cloud for relative geometric components. Then, we develop a geometry constraint generation module to generate constraints among the geometric components that are necessary to complete the sub-task. Finally, we establish the cost functions to measure the fulfillment of the geometric constraints and solve the robotic trajectories via optimization.
Figure 3: Existing open-vocabulary image segmentation methods (LISA lai2024lisa, OV-seg liang2023open) fail to segment the fine-grained geometric components, while our method segments them correctly.
Figure 4: Illustration of our select-process scheme in parsing the geometry.
Figure 5: Our embodied agent comprises five components: (i) a user input block that accepts the current observation of the scene, the language command from user and uploaded videos of robotic to human manipulation of the sub-task; (ii) a geometric constraint block to display the generated geometric constraints for the sub-task allowing for modifications; (iii) a cost function block to present the developed cost function based on the geometric constraints; (iv) a geometric component visualizer to show the mask of the geometric component involved in the sub-task; (v) a trajectory visualizer that illustrates the planned trajectory in the scene.
...and 8 more figures

GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation

TL;DR

Abstract

GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)