Table of Contents
Fetching ...

Building Explicit World Model for Zero-Shot Open-World Object Manipulation

Xiaotong Li, Gang Chen, Javier Alonso-Mora

Abstract

Open-world object manipulation remains a fundamental challenge in robotics. While Vision-Language-Action (VLA) models have demonstrated promising results, they rely heavily on large-scale robot action demonstrations, which are costly to collect and can hinder out-of-distribution generalization. In this paper, we propose an explicit-world-model-based framework for open-world manipulation that achieves zero-shot generalization by constructing a physically grounded digital twin of the environment. The framework integrates open-set perception, digital-twin reconstruction, sampling and evaluation of interaction strategies. By constructing a digital twin of the environment, our approach efficiently explores and evaluates manipulation strategies in physic-enabled simulator and reliably deploys the chosen strategy to the real world. Experimentally, the proposed framework is able to perform multiple open-set manipulation tasks without any task-specific action demonstrations, proving strong zero-shot generalization on both the task and object levels. Project Page: https://bojack-bj.github.io/projects/thesis/

Building Explicit World Model for Zero-Shot Open-World Object Manipulation

Abstract

Open-world object manipulation remains a fundamental challenge in robotics. While Vision-Language-Action (VLA) models have demonstrated promising results, they rely heavily on large-scale robot action demonstrations, which are costly to collect and can hinder out-of-distribution generalization. In this paper, we propose an explicit-world-model-based framework for open-world manipulation that achieves zero-shot generalization by constructing a physically grounded digital twin of the environment. The framework integrates open-set perception, digital-twin reconstruction, sampling and evaluation of interaction strategies. By constructing a digital twin of the environment, our approach efficiently explores and evaluates manipulation strategies in physic-enabled simulator and reliably deploys the chosen strategy to the real world. Experimentally, the proposed framework is able to perform multiple open-set manipulation tasks without any task-specific action demonstrations, proving strong zero-shot generalization on both the task and object levels. Project Page: https://bojack-bj.github.io/projects/thesis/
Paper Structure (15 sections, 6 equations, 7 figures, 2 tables)

This paper contains 15 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our manipulation task framework with explicit world model construction. The system takes an RGB-D observation $I$ and a natural language task instruction $C$ as input and proceeds through four main stages: (A) Open-set segmentation, (B) Open-set object grasping, (C) Digital twin reconstruction, (D) Manipulation strategy sampling. The outcome is the most promising manipulation strategy, which is then executed on the real robot.
  • Figure 2: The proposed digital twin construction module, containing the mesh generation and two-stage pose alignment. We first generate a textured mesh from the masked RGB image via Hunyuan3D 2.0 zhao2025hunyuan3d20scalingdiffusion. During coarse alignment, we render RGB and depth images from a set of hypopaper poses and compare their similarities with real-world observation in DINO oquab2024dinov feature space, and select the one that best matches the real-world observation. The resulting coarse pose is then refined using RANSAC and ICP on the partial point cloud back-projected from the depth image.
  • Figure 3: Our manipulation strategy sampling module. (a) Interaction area segmentation: We use GPT-4o and Grounded-SAM to segment the interaction area to constrain the sample space of translation. (b) Interaction strategy sampling: Different rotations of the dynamic object are sampled. The outcomes are simulated in the Isaac Sim. (c) Result checking: The results are rendered in the simulator. We query the GPT-4o again to check which samples fulfill the task requirements, and feed the results to a Gaussian Process classifier to get the success probabilities.
  • Figure 4: Qualitative results of the proposed two-stage mesh alignment method. The left image shows the aligned meshes overlaid with the RGB-D observation. The right image illustrates the corresponding real-world setup.
  • Figure 5: Representative tasks used for experimental validation. Frames with the same color represent the same task, where the top one denotes the initial scene and the bottom one denotes the final scene.
  • ...and 2 more figures