Table of Contents
Fetching ...

AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation

Yulu Wu, Jiujun Cheng, Haowen Wang, Dengyang Suo, Pei Ren, Qichao Mao, Shangce Gao, Yakun Huang

TL;DR

AOMGen tackles the data bottleneck in articulated object manipulation by generating photoreal, physically consistent demonstrations from a single real scan and demonstration. It combines scene reconstruction with 3D Gaussian Splatting and a motion-recovery pipeline to capture accurate interactions, then replaces the articulated object with other category instances and generalizes their poses. The approach yields data that significantly improves Vision-Language-Action policy fine-tuning and robustness to unseen objects and configurations. This framework reduces reliance on extensive real-world data or imperfect simulators, enabling scalable, realistic training data for complex articulated manipulation tasks.

Abstract

Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.

AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation

TL;DR

AOMGen tackles the data bottleneck in articulated object manipulation by generating photoreal, physically consistent demonstrations from a single real scan and demonstration. It combines scene reconstruction with 3D Gaussian Splatting and a motion-recovery pipeline to capture accurate interactions, then replaces the articulated object with other category instances and generalizes their poses. The approach yields data that significantly improves Vision-Language-Action policy fine-tuning and robustness to unseen objects and configurations. This framework reduces reliance on extensive real-world data or imperfect simulators, enabling scalable, realistic training data for complex articulated manipulation tasks.

Abstract

Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.

Paper Structure

This paper contains 25 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: As a powerful articulated object manipulation data generator, the proposed AOGen generates visually realistic and interaction-accurate data for any object of the same category within a unified framework. At the same time, the generated data provides effective assistance in improving the model's performance.
  • Figure 2: Pipeline of the proposed AOMGen, where a rotational joint object is used as an example to illustrate the complete pipeline, while a prismatic object can be handled in the same manner.
  • Figure 3: Computation of the Motion Score.
  • Figure 4: The Gaussian field visualizations of the data generated from AOMGen.
  • Figure 5: Replay of generated data in the simulator.
  • ...and 2 more figures