Table of Contents
Fetching ...

ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation

Chenyu Su, Weiwei Shang, Chen Qian, Fei Zhang, Shuang Cong

TL;DR

ReSem3D tackles the challenge of converting high-level natural-language instructions into fine-grained, executable 3D spatial constraints for robotic manipulation in semantically diverse, unstructured environments. It combines Vision Foundation Models and Multimodal LLMs to construct a two-stage hierarchical 3D spatial constraint model and encodes these as real-time joint-space optimization objectives solved by MPPI within an MLLM-driven TAMP framework. The paper demonstrates strong zero-shot generalization and robustness across household and chemical-lab scenarios, achieving reactive closed-loop control on multiple robotic platforms. This work advances semantically grounded manipulation with real-time feedback, enabling more flexible and reliable autonomous manipulation in open-world settings.

Abstract

Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.

ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation

TL;DR

ReSem3D tackles the challenge of converting high-level natural-language instructions into fine-grained, executable 3D spatial constraints for robotic manipulation in semantically diverse, unstructured environments. It combines Vision Foundation Models and Multimodal LLMs to construct a two-stage hierarchical 3D spatial constraint model and encodes these as real-time joint-space optimization objectives solved by MPPI within an MLLM-driven TAMP framework. The paper demonstrates strong zero-shot generalization and robustness across household and chemical-lab scenarios, achieving reactive closed-loop control on multiple robotic platforms. This work advances semantically grounded manipulation with real-time feedback, enabling more flexible and reliable autonomous manipulation in open-world settings.

Abstract

Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.

Paper Structure

This paper contains 18 sections, 35 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overall Framework. Given natural language instruction and RGB-D observations, VFM segments semantically relevant part-level regions and overlays visual prompts to facilitate initial constraint generation. Under the MLLM-driven TAMP framework, constraint modeling is conducted hierarchically in two stages: part-level extraction and region-level refinement. The resulting 3D spatial constraints are encoded as cost function for real-time parsing and solved in closed-loop by MPPI-based optimizer in Isaac Gym, enabling joint-space velocity control with point tracking.
  • Figure 2: Part-Level Constraint Extraction: Original mask is generated via Fast Segment Anything (FastSAM), followed by mask filtering, part-level semantic consistency cluster, and centroid annotation to extract part-level spatial constraints within the MLLM-driven TAMP framework.
  • Figure 3: Region-Level Constraint Refinement. Within the MLLM-driven TAMP framework, this module encompasses two strategies: geometric and positional constraint refinement. The geometric refinement adjusts the tweezers’ grasp point from the center to the two tips, thereby introducing detailed geometric priors. The positional refinement localizes the trash bin’s placement to the midpoint between symmetrical centers, thereby enhancing the spatial precision.
  • Figure 4: MLLM-driven automated modeling and communication framework for TAMP.
  • Figure 5: ReSem3D is a unified robotic manipulation framework for semantically diverse environments. It leverages the synergy between MLLMs and VFMs to construct semantics-driven, two-stage hierarchical 3D spatial constraints, which are mapped into real-time optimization objectives in joint space to enable closed-loop perception-action control.
  • ...and 4 more figures