Table of Contents
Fetching ...

GRS: Generating Robotic Simulation Tasks from Real-World Images

Alex Zook, Fan-Yun Sun, Josef Spjut, Valts Blukis, Stan Birchfield, Jonathan Tremblay

TL;DR

GRS tackles real-to-sim translation for robotics by deriving digital twin simulations from a single RGB-D image. It couples scene understanding via SAM2 and VLM-based object descriptions with asset matching and task generation to form solvable robotic objectives. A novel LLM-based router iteratively refines both the simulation program and its test suite to ensure alignment with the intended task. Empirical results show robust object correspondence, effective task generation, and scalability to large asset libraries, highlighting the method's potential for automated robotics training, game development, and educational simulations.

Abstract

We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.

GRS: Generating Robotic Simulation Tasks from Real-World Images

TL;DR

GRS tackles real-to-sim translation for robotics by deriving digital twin simulations from a single RGB-D image. It couples scene understanding via SAM2 and VLM-based object descriptions with asset matching and task generation to form solvable robotic objectives. A novel LLM-based router iteratively refines both the simulation program and its test suite to ensure alignment with the intended task. Empirical results show robust object correspondence, effective task generation, and scalability to large asset libraries, highlighting the method's potential for automated robotics training, game development, and educational simulations.

Abstract

We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.

Paper Structure

This paper contains 15 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: GRS solves the problem of generating robotics simulations with solvable tasks from real-world images. During task generation, GRS can use a subset of objects and change the object orientation and positioning to provide interesting variations on the initial scene.
  • Figure 2: GRS workflow for real-to-sim conversion. The process has four stages: 1) scene description generation using segmented images and simulation-ready assets; 2) task creation based on the scene; 3) initial simulation and test code generation; 4) iterative refinement until error-free simulation is achieved. Color and shape coding is used for certain inputs to enhance visual clarity.
  • Figure 3: Qualitative results of five scenarios (rows) depicting task execution. Each row presents the input scene image (left), initial state (center), and final state after an oracle task execution (right). Task descriptions are provided beneath the central and rightmost columns. Note that task generation may not use all observed assets nor object orientations from the input image. Following GenSim we include the use of basic shapes like colored cubes and containers.
  • Figure 4: Real-world image (left) and a simulation environment in 3D (right) obtained with background reconstruction plus GRS's scene comprehension pipeline, using Objaverse deitke2023objaverse as the asset repository.