Table of Contents
Fetching ...

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo

Abstract

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Abstract

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.
Paper Structure (30 sections, 1 equation, 9 figures, 4 tables)

This paper contains 30 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: ManiTwin. An automated pipeline for synthesizing data-generation-ready assets and rich annotations, and ManiTwin-100K, a large-scale dataset of 100K such assets with diverse, high-quality annotations.
  • Figure 2: ManiTwin Pipeline Overview. Our pipeline consists of three stages: (I) Asset Generation transforms input images into simulation-ready 3D meshes with VLM-estimated physical properties; (II) Asset Annotation combines FPS-based candidate sampling, VLM-driven functional and grasp point selection, and learning-based grasp proposal generation; (III) Verification validates annotations through physics simulation and human review, producing fully annotated digital twins ready for robotic manipulation research.
  • Figure 3: Annotation Visualization. ManiTwin provides functional points (e.g., spout, handle), grasp points with type labels, and simulation-verified 6-DoF grasp poses for each object.
  • Figure 4: ManiTwin Data Generation. (Left) Cross-embodiment manipulation trajectories across multiple end-effectors using shared object annotations. (Right) Grasping data generation.
  • Figure 5: Layout Generation. Using placement and collision radius annotations, we generate diverse multi-object scene layouts that are collision-free and physically plausible.
  • ...and 4 more figures