Proc4Gem: Foundation models for physical agency through procedural generation
Yixin Lin, Jan Humplik, Sandy H. Huang, Leonard Hasenclever, Francesco Romano, Stefano Saliceti, Daniel Zheng, Jose Enrique Chen, Catarina Barros, Adrian Collister, Matt Young, Adil Dostmohamed, Ben Moran, Ken Caluwaerts, Marissa Giustina, Joss Moore, Kieran Connell, Francesco Nori, Nicolas Heess, Steven Bohez, Arunkumar Byravan
TL;DR
This work tackles the challenge of integrating semantic understanding with physical contact dynamics in robot learning by pairing photorealistic, procedurally generated simulations with large multimodal foundation models. It introduces Proc4Gem, a pipeline that trains a language-conditioned, whole-body control policy by fine-tuning Gemini on simulation data and deploying it on a real quadruped to push objects toward unseen targets in new environments. The results show strong sim2real transfer, with Gemini outperforming a strong baseline in hardware, especially under hard or out-of-distribution conditions, and demonstrate robustness to language variation and scene diversity. These findings highlight the potential of realistic simulation as a scalable source of task-focused data for grounding foundation models in physical agency, with implications for end-to-end embodied AI in real-world robotics. Future directions include leveraging larger context windows, enhancing visual generation, and moving beyond behavior cloning toward reinforcement learning within simulation.
Abstract
In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency. Videos can be found at our website: https://sites.google.com/view/proc4gem
