Table of Contents
Fetching ...

Proc4Gem: Foundation models for physical agency through procedural generation

Yixin Lin, Jan Humplik, Sandy H. Huang, Leonard Hasenclever, Francesco Romano, Stefano Saliceti, Daniel Zheng, Jose Enrique Chen, Catarina Barros, Adrian Collister, Matt Young, Adil Dostmohamed, Ben Moran, Ken Caluwaerts, Marissa Giustina, Joss Moore, Kieran Connell, Francesco Nori, Nicolas Heess, Steven Bohez, Arunkumar Byravan

TL;DR

This work tackles the challenge of integrating semantic understanding with physical contact dynamics in robot learning by pairing photorealistic, procedurally generated simulations with large multimodal foundation models. It introduces Proc4Gem, a pipeline that trains a language-conditioned, whole-body control policy by fine-tuning Gemini on simulation data and deploying it on a real quadruped to push objects toward unseen targets in new environments. The results show strong sim2real transfer, with Gemini outperforming a strong baseline in hardware, especially under hard or out-of-distribution conditions, and demonstrate robustness to language variation and scene diversity. These findings highlight the potential of realistic simulation as a scalable source of task-focused data for grounding foundation models in physical agency, with implications for end-to-end embodied AI in real-world robotics. Future directions include leveraging larger context windows, enhancing visual generation, and moving beyond behavior cloning toward reinforcement learning within simulation.

Abstract

In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency. Videos can be found at our website: https://sites.google.com/view/proc4gem

Proc4Gem: Foundation models for physical agency through procedural generation

TL;DR

This work tackles the challenge of integrating semantic understanding with physical contact dynamics in robot learning by pairing photorealistic, procedurally generated simulations with large multimodal foundation models. It introduces Proc4Gem, a pipeline that trains a language-conditioned, whole-body control policy by fine-tuning Gemini on simulation data and deploying it on a real quadruped to push objects toward unseen targets in new environments. The results show strong sim2real transfer, with Gemini outperforming a strong baseline in hardware, especially under hard or out-of-distribution conditions, and demonstrate robustness to language variation and scene diversity. These findings highlight the potential of realistic simulation as a scalable source of task-focused data for grounding foundation models in physical agency, with implications for end-to-end embodied AI in real-world robotics. Future directions include leveraging larger context windows, enhancing visual generation, and moving beyond behavior cloning toward reinforcement learning within simulation.

Abstract

In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency. Videos can be found at our website: https://sites.google.com/view/proc4gem

Paper Structure

This paper contains 25 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Method overview: (1) We use a dataset of furniture objects and random procedural generation to sample living room scenes for training (Section \ref{['sec:methods-sim']}). (2) We use reinforcement learning to train a privileged trolley-pushing policy, and then collect rollouts from this policy while rendering high-resolution RGB images using Unity and generating captions for target objects using Gemini (Section \ref{['sec:methods-task-experts']}). (3) We use behavior cloning to train transformer-based policies on this vision-language-action data (Section \ref{['sec:methods-distillation']}). (4) We evaluate these policies in a held-out real-world scene (Section \ref{['sec:results']}).
  • Figure 2: Examples of procedurally-generated living room scenes. We use a hierarchical placement recipe to place assets in semantically-meaningful configurations.
  • Figure 3: Comparison between the real-world setup and the equivalent simulated scene with Unity rendering. This scene is not used for training data generation, only for evaluation.
  • Figure 4: Barkour robot.
  • Figure 5: Simulation results in procedurally-generated scenes.
  • ...and 7 more figures