ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Zhao Jin; Zhengping Che; Zhen Zhao; Kun Wu; Yuheng Zhang; Yinuo Zhao; Zehui Liu; Qiang Zhang; Xiaozhu Ju; Jing Tian; Yousong Xue; Jian Tang

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

TL;DR

ArtVIP introduces a high-quality, open-source dataset of 206 articulated digital-twin assets across 26 categories, with realistic geometry, textures, and physically calibrated joints, plus scene assets and pixel-level affordances. It embeds modular interaction behaviors directly into assets and provides precise collision and joint dynamics to reduce the sim-to-real gap. The authors validate ArtVIP through objective visual realism and physical fidelity evaluations and demonstrate effectiveness in imitation learning on real robots and reinforcement learning in high-fidelity simulators. The dataset, delivered in USD format with production guidelines, aims to accelerate diverse robotic manipulation research by offering ready-to-use, reusable assets for sim-to-real transfer.

Abstract

Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

TL;DR

Abstract

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)