GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation
Weiliang Tang, Jia-Hui Pan, Yun-Hui Liu, Masayoshi Tomizuka, Li Erran Li, Chi-Wing Fu, Mingyu Ding
TL;DR
GeoManip presents a training-free framework that translates natural language task descriptions into precise robot manipulation via geometric constraints. It introduces a geometry parser, a constraint generator, and a cost-function–based trajectory solver, augmented by a GPT-4o–driven reasoning loop that yields constraint sets and corresponding cost functions $f: \mathcal{P} \to \mathbb{R}^+$ and an optimization over $R \in SE(3)$ and $\\mathbf{t} \in \mathbb{R}^3$ to minimize $\\min_{R,t} \\frac{1}{K^s} \\sum_{f \in \\mathcal{F}^s} f(...) + \alpha \|\\mathbf{t}-\\mathbf{t}_0\|_2 + \beta \\|\\mathrm{euler}(R R_0^{-1})\\|_1$. The approach demonstrates strong generalization to unseen tasks and objects in both simulation and real-world settings without additional training, outperforming training-based baselines and a geometry-only method. The framework enables five interaction capabilities—on-the-fly adaptation, learning from failures and demonstrations, long-horizon planning, and efficient imitation-learning data collection—making robotic manipulation more generalizable and data-efficient. Overall, GeoManip offers a scalable, interpretable interface that bridges language, geometry, and action for generalist robots with practical impact in diverse manipulation tasks.
Abstract
We present GeoManip, a framework to enable generalist robots to leverage essential conditions derived from object and part relationships, as geometric constraints, for robot manipulation. For example, cutting the carrot requires adhering to a geometric constraint: the blade of the knife should be perpendicular to the carrot's direction. By interpreting these constraints through symbolic language representations and translating them into low-level actions, GeoManip bridges the gap between natural language and robotic execution, enabling greater generalizability across diverse even unseen tasks, objects, and scenarios. Unlike vision-language-action models that require extensive training, operates training-free by utilizing large foundational models: a constraint generation module that predicts stage-specific geometric constraints and a geometry parser that identifies object parts involved in these constraints. A solver then optimizes trajectories to satisfy inferred constraints from task descriptions and the scene. Furthermore, GeoManip learns in-context and provides five appealing human-robot interaction features: on-the-fly policy adaptation, learning from human demonstrations, learning from failure cases, long-horizon action planning, and efficient data collection for imitation learning. Extensive evaluations on both simulations and real-world scenarios demonstrate GeoManip's state-of-the-art performance, with superior out-of-distribution generalization while avoiding costly model training.
