Table of Contents
Fetching ...

Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation

Mingyang Sun, Jiude Wei, Qichen He, Donglin Wang, Cewu Lu, Jianhua Sun

TL;DR

GRACE tackles the semantic-to-physical gap in vision-language grounded robotics by introducing Executable Analytic Concepts ($EAC$), a mathematics-based blueprint system that encodes object geometry, affordances, and manipulation constraints. The approach uses a policy scaffolding pipeline to convert natural language and perceptual input into $6$-DoF grasp poses and precise force directions that feed a motion planner, enabling interpretable and physically grounded execution. Through extensive simulation and real-world experiments, GRACE demonstrates strong zero-shot generalization on articulated objects and shows that grounding VLM reasoning with analytic concepts substantially improves precision over prior end-to-end or primitive-grounding methods. The work offers a practical, interpretable bridge between high-level semantic reasoning and low-level control, with potential to integrate smoothly into existing Visual-Language-Action architectures.

Abstract

Enabling robots to perform precise and generalized manipulation in unstructured environments remains a fundamental challenge in embodied AI. While Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning, a significant gap persists between their high-level understanding and the precise physical execution required for real-world manipulation. To bridge this "semantic-to-physical" gap, we introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts (EAC)-mathematically defined blueprints that encode object affordances, geometric constraints, and semantics of manipulation. Our approach integrates a structured policy scaffolding pipeline that turn natural language instructions and visual information into an instantiated EAC, from which we derive grasp poses, force directions and plan physically feasible motion trajectory for robot execution. GRACE thus provides a unified and interpretable interface between high-level instruction understanding and low-level robot control, effectively enabling precise and generalizable manipulation through semantic-physical grounding. Extensive experiments demonstrate that GRACE achieves strong zero-shot generalization across a variety of articulated objects in both simulated and real-world environments, without requiring task-specific training.

Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation

TL;DR

GRACE tackles the semantic-to-physical gap in vision-language grounded robotics by introducing Executable Analytic Concepts (), a mathematics-based blueprint system that encodes object geometry, affordances, and manipulation constraints. The approach uses a policy scaffolding pipeline to convert natural language and perceptual input into -DoF grasp poses and precise force directions that feed a motion planner, enabling interpretable and physically grounded execution. Through extensive simulation and real-world experiments, GRACE demonstrates strong zero-shot generalization on articulated objects and shows that grounding VLM reasoning with analytic concepts substantially improves precision over prior end-to-end or primitive-grounding methods. The work offers a practical, interpretable bridge between high-level semantic reasoning and low-level control, with potential to integrate smoothly into existing Visual-Language-Action architectures.

Abstract

Enabling robots to perform precise and generalized manipulation in unstructured environments remains a fundamental challenge in embodied AI. While Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning, a significant gap persists between their high-level understanding and the precise physical execution required for real-world manipulation. To bridge this "semantic-to-physical" gap, we introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts (EAC)-mathematically defined blueprints that encode object affordances, geometric constraints, and semantics of manipulation. Our approach integrates a structured policy scaffolding pipeline that turn natural language instructions and visual information into an instantiated EAC, from which we derive grasp poses, force directions and plan physically feasible motion trajectory for robot execution. GRACE thus provides a unified and interpretable interface between high-level instruction understanding and low-level robot control, effectively enabling precise and generalizable manipulation through semantic-physical grounding. Extensive experiments demonstrate that GRACE achieves strong zero-shot generalization across a variety of articulated objects in both simulated and real-world environments, without requiring task-specific training.

Paper Structure

This paper contains 26 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example implementation of executable analytic concepts. (a) Geometric Concept Assets. Each asset exposes its free parameters (top), canonical structure (mid), and partial affordance cues (bottom). (b) Structural Blueprint: higher-level objects are procedurally composed by wiring multiple geometric assets together, forming a parametric graph that captures their spatial layout and structural relationships. (c) Manipulation Blueprint: parameterised routines compute grasp poses and force directions that exploit the affordances encoded in the underlying structure.
  • Figure 2: An overview of the proposed method GRACE. (I) Task Parsing: A Vision–Language Model (VLM) parses the natural-language instruction based on the current RGB image. (II) Policy Scaffolding: The process includes: 1. segmenting the target object from images and back-projecting it to a partial point cloud; 2. parsing the analytic concept and estimating geometric parameters to instantiate the structural blueprint; 3. constructing the manipulation blueprint to produce feasible grasp poses and force directions; 4. generating a joint-space trajectory via a motion-planning module using the blueprints. (III) Robot Execution: The trajectory is executed to complete the task.
  • Figure 3: Visualize the results of grasping objects and their corresponding EAC. The red parts in the second column indicate the target part.
  • Figure 4: Hardware Configuration.