Table of Contents
Fetching ...

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

TL;DR

ArtVIP introduces a high-quality, open-source dataset of 206 articulated digital-twin assets across 26 categories, with realistic geometry, textures, and physically calibrated joints, plus scene assets and pixel-level affordances. It embeds modular interaction behaviors directly into assets and provides precise collision and joint dynamics to reduce the sim-to-real gap. The authors validate ArtVIP through objective visual realism and physical fidelity evaluations and demonstrate effectiveness in imitation learning on real robots and reinforcement learning in high-fidelity simulators. The dataset, delivered in USD format with production guidelines, aims to accelerate diverse robotic manipulation research by offering ready-to-use, reusable assets for sim-to-real transfer.

Abstract

Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

TL;DR

ArtVIP introduces a high-quality, open-source dataset of 206 articulated digital-twin assets across 26 categories, with realistic geometry, textures, and physically calibrated joints, plus scene assets and pixel-level affordances. It embeds modular interaction behaviors directly into assets and provides precise collision and joint dynamics to reduce the sim-to-real gap. The authors validate ArtVIP through objective visual realism and physical fidelity evaluations and demonstrate effectiveness in imitation learning on real robots and reinforcement learning in high-fidelity simulators. The dataset, delivered in USD format with production guidelines, aims to accelerate diverse robotic manipulation research by offering ready-to-use, reusable assets for sim-to-real transfer.

Abstract

Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .

Paper Structure

This paper contains 24 sections, 7 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of ArtVIP.
  • Figure 2: An asset example in ArtVIP. Left: Top-down assembly principle. Middle: Assembly process. Right: Comparison between the real object (a) with its digital-twin (b), and annotations (c).
  • Figure 3: Left: Comparison of triangles counts. Right: Rendering comparison.
  • Figure 4: Left: Reconstruction of a microwave. OmniGibson yields poor results due to coarse geometry, while ArtVIP enables better reconstruction via more realistic details. Right: CLIP-based radford2021learning feature distribution. Each color denotes a data source and ArtVIP features align more closely with real-world data.
  • Figure 5: Left: Digital-twin asset examples in real-world and simulation. Right: Analysis of the drawer's displacement driven by different forces.
  • ...and 7 more figures