Table of Contents
Fetching ...

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, Abhishek Gupta

TL;DR

URDFormer tackles the need for scalable, realistic articulated simulation environments derived from real-world images to train robust robotic policies. It introduces a forward-inverse pipeline where controllable diffusion generates paired image-scene data and a transformer-based inverse model predicts URDFs from RGB inputs, decomposed into scene-level and object-level components. The approach is integrated into a real-to-sim-to-real pipeline with targeted randomization and is validated through RealityGym and zero-shot real-world transfers, achieving notable success (e.g., 78% overall). It further demonstrates generalization to new object/scene types and supports multiple robots and tasks, offering a practical path toward large-scale, data-efficient robot learning in realistic simulations.

Abstract

Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

TL;DR

URDFormer tackles the need for scalable, realistic articulated simulation environments derived from real-world images to train robust robotic policies. It introduces a forward-inverse pipeline where controllable diffusion generates paired image-scene data and a transformer-based inverse model predicts URDFs from RGB inputs, decomposed into scene-level and object-level components. The approach is integrated into a real-to-sim-to-real pipeline with targeted randomization and is validated through RealityGym and zero-shot real-world transfers, achieving notable success (e.g., 78% overall). It further demonstrates generalization to new object/scene types and supports multiple robots and tasks, offering a practical path toward large-scale, data-efficient robot learning in realistic simulations.

Abstract

Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.
Paper Structure (40 sections, 25 figures, 2 tables)

This paper contains 40 sections, 25 figures, 2 tables.

Figures (25)

  • Figure 2: The URDFormer is trained on a large paired dataset of simulation assets and realistic renderings (forward). During inference, this process is inverted and it predicts the URDF from a real image (inverse).
  • Figure 3: Controlled Generation: Rendering URDF models in simulation and generating paired images with a guided diffusion model.
  • Figure 4: Depiction of the URDFormer Training Procedure and Architecture. (Left) Given an RGB image of the scene, i.e. a kitchen, we train two separate networks: URDFormer (Global) focuses on predicting parent and spatial info of how to place the object. URDFormer (Part) takes the cropped image containing each object and predicts detailed structure. The results of the two predictions are combined and create the full scene prediction. (Right) The URDFormer architecture takes as input a cropped RGB image and object part boxes and predicts a hierarchy consisting of a base class and parent-child relations that make up the final URDF file.
  • Figure 5: Qualitative Results for Real-world Robot Experiments: An RGB image is pre-processed by detecting bounding boxes of relevant parts. The URDFormer then predicts the corresponding URDF of the cabinet. When importing the cabinet into the simulation, it is re-scaled using depth measurements. Furthermore, the real-world texture is cropped using the bounding boxes and projected onto the cabinet. This realistic simulation can then be used to generate massive data with the help of motion planning, ground truth information, and targeted domain randomization. Finally, we show that training a language-conditioned multi-task policy can be zero-shot transferred to the real world to solve several opening and closing tasks.
  • Figure 6: Reality Gym: A simulation environment with a variety of assets originated from internet images (black box) using URDFormer. We predict URDFs of internet images which can be loaded in any simulator. These URDFs are randomized with meshes from the Partnet dataset. We introduce 4 main tasks: (1) Open any articulated parts (2) close any articulated parts (3) fetch objects and (4) collect objects
  • ...and 20 more figures