Table of Contents
Fetching ...

DexSim2Real$^{2}$: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

Taoran Jiang, Yixuan Guan, Liqian Ma, Jing Xu, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Rui Chen

TL;DR

DexSim2Real2 presents an explicit world-model framework for unseen articulated objects built via interactive perception and 3D AIGC-based geometry, enabling long-horizon manipulation with sampling-based MPC. It supports suction, two-finger, and dexterous hands, aided by eigengrasp to manage high-dimensional action spaces, and leverages VRB and Where2Act for affordance learning from both simulations and human videos. The explicit world model and URDF-based simulators enable precise manipulation and tool-use on unseen objects with reduced data requirements compared to policy-learning approaches. Across multiple objects and end-effectors, the approach demonstrates accurate real-world manipulation and scalable handling of multi-part articulations.

Abstract

Articulated objects are ubiquitous in daily life. In this paper, we present DexSim2Real$^{2}$, a novel framework for goal-conditioned articulated object manipulation. The core of our framework is constructing an explicit world model of unseen articulated objects through active interactions, which enables sampling-based model predictive control to plan trajectories achieving different goals without requiring demonstrations or RL. It first predicts an interaction using an affordance network trained on self-supervised interaction data or videos of human manipulation. After executing the interactions on the real robot to move the object parts, we propose a novel modeling pipeline based on 3D AIGC to build a digital twin of the object in simulation from multiple frames of observations. For dexterous hands, we utilize eigengrasp to reduce the action dimension, enabling more efficient trajectory searching. Experiments validate the framework's effectiveness for precise manipulation using a suction gripper, a two-finger gripper and two dexterous hand. The generalizability of the explicit world model also enables advanced manipulation strategies like manipulating with tools.

DexSim2Real$^{2}$: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

TL;DR

DexSim2Real2 presents an explicit world-model framework for unseen articulated objects built via interactive perception and 3D AIGC-based geometry, enabling long-horizon manipulation with sampling-based MPC. It supports suction, two-finger, and dexterous hands, aided by eigengrasp to manage high-dimensional action spaces, and leverages VRB and Where2Act for affordance learning from both simulations and human videos. The explicit world model and URDF-based simulators enable precise manipulation and tool-use on unseen objects with reduced data requirements compared to policy-learning approaches. Across multiple objects and end-effectors, the approach demonstrates accurate real-world manipulation and scalable handling of multi-part articulations.

Abstract

Articulated objects are ubiquitous in daily life. In this paper, we present DexSim2Real, a novel framework for goal-conditioned articulated object manipulation. The core of our framework is constructing an explicit world model of unseen articulated objects through active interactions, which enables sampling-based model predictive control to plan trajectories achieving different goals without requiring demonstrations or RL. It first predicts an interaction using an affordance network trained on self-supervised interaction data or videos of human manipulation. After executing the interactions on the real robot to move the object parts, we propose a novel modeling pipeline based on 3D AIGC to build a digital twin of the object in simulation from multiple frames of observations. For dexterous hands, we utilize eigengrasp to reduce the action dimension, enabling more efficient trajectory searching. Experiments validate the framework's effectiveness for precise manipulation using a suction gripper, a two-finger gripper and two dexterous hand. The generalizability of the explicit world model also enables advanced manipulation strategies like manipulating with tools.
Paper Structure (46 sections, 16 equations, 22 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 16 equations, 22 figures, 5 tables, 1 algorithm.

Figures (22)

  • Figure 1: DexSim2Real$^\textbf{2}$ is a robot learning framework for precise goal-conditioned articulated object manipulation with suction grippers, two-finger grippers, and multi-finger dexterous hands in the real world. It builds the mental model of the unseen target object through active interactions and uses the model to generate a long-horizon manipulation trajectory.
  • Figure 2: Overview of the DexSim2Real$^\textbf{2}$ framework. Our framework consists of three phases. (1) Given a partial point cloud of an unseen articulated object, in the Interactive Perception phase, we train an affordance prediction module and use it to change the object’s joint state through a one-step interaction. Training data can be acquired through self-supervised interaction in simulation or from egocentric human demonstration videos. (2) In the Explicit Physics Model Construction phase, we build a mental model in a physics simulator from the $K+1$ frames of observations. (3) In the Sampling-based Model Predictive Control phase, we use the model to plan a long-horizon trajectory in simulation and then execute the trajectory on the real robot to complete the task. For dexterous hands, an eigengrasp module is needed for dimensionality reduction.
  • Figure 3: (a)Framework of generating real robot manipulation trajectory from 2D affordances. (b)Calculation method of 3D post contact vector generation.
  • Figure 4: Our pipeline for explicit world model construction: For each state of the articulated object, we begin by generating an unaligned and unscaled mesh from multi-view RGB images using 3D AIGC. Next, we estimate the scale and pose through differentiable rendering, and segment the aligned mesh into sub-parts. Once segmented point clouds for each state are obtained, we infer movable part segmentation by analyzing differences between frames of point clouds. We estimate the kinematic structure of the mesh, including the part tree hierarchy, joint categories (prismatic or revolute), and joint configurations (axis direction and origin). Finally, we construct a digital twin of the articulated object represented in URDF format, which can be easily loaded into different physics simulators.
  • Figure 5: Real-world experimental setup of (a) suction gripper; (b) two-finger gripper and (c)(d)dexterous hands. Here we use another RealSense D415 as the Side camera.
  • ...and 17 more figures