GRS: Generating Robotic Simulation Tasks from Real-World Images
Alex Zook, Fan-Yun Sun, Josef Spjut, Valts Blukis, Stan Birchfield, Jonathan Tremblay
TL;DR
GRS tackles real-to-sim translation for robotics by deriving digital twin simulations from a single RGB-D image. It couples scene understanding via SAM2 and VLM-based object descriptions with asset matching and task generation to form solvable robotic objectives. A novel LLM-based router iteratively refines both the simulation program and its test suite to ensure alignment with the intended task. Empirical results show robust object correspondence, effective task generation, and scalability to large asset libraries, highlighting the method's potential for automated robotics training, game development, and educational simulations.
Abstract
We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.
