Table of Contents
Fetching ...

SIMA 2: A Generalist Embodied Agent for Virtual Worlds

SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, Cong Lu, Daan Wierstra, Daniel P. Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A. Hudson, Duncan Williams, Ed Hirst, Fabio Pardo, Felix Hill, Frederic Besse, Hannah Openshaw, Harris Chan, Hubert Soyer, Jane X. Wang, Jeff Clune, John Agapiou, John Reid, Joseph Marino, Junkyung Kim, Karol Gregor, Kaustubh Sridhar, Kay McKinney, Laura Kampis, Lei M. Zhang, Loic Matthey, Luyu Wang, Maria Abi Raad, Maria Loks-Thompson, Martin Engelcke, Matija Kecman, Matthew Jackson, Maxime Gazeau, Ollie Purkiss, Oscar Knagg, Peter Stys, Piermaria Mendolicchio, Raia Hadsell, Rosemary Ke, Ryan Faulkner, Sarah Chakera, Satinder Singh Baveja, Shane Legg, Sheleem Kashem, Tayfun Terzi, Thomas Keck, Tim Harley, Tim Scholtes, Tyson Roberts, Volodymyr Mnih, Yulan Liu, Zhengdong Wang, Zoubin Ghahramani

TL;DR

SIMA 2 presents a Gemini-based embodied agent capable of reasoning, dialogue, and action across diverse 3D virtual worlds, advancing beyond SIMA 1 with goal-directed planning and multi-modal instruction. By integrating Gemini’s reasoning with a perception-action loop and a mixed data training regime, SIMA 2 achieves substantial task success, generalizes to held-out environments, and demonstrates open-ended self-improvement via a Gemini task setter and reward model. The work highlights strong generalization to photorealistic environments generated by Genie 3 and shows how embodied agents can be guided and improved with minimal human input, signaling a pathway toward more capable, lifelong-learning agents that could transfer to real-world robotics. It also discusses limitations and responsible development, outlining future directions for longer-horizon reasoning, memory, and robust low-level control in complex 3D worlds.

Abstract

We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.

SIMA 2: A Generalist Embodied Agent for Virtual Worlds

TL;DR

SIMA 2 presents a Gemini-based embodied agent capable of reasoning, dialogue, and action across diverse 3D virtual worlds, advancing beyond SIMA 1 with goal-directed planning and multi-modal instruction. By integrating Gemini’s reasoning with a perception-action loop and a mixed data training regime, SIMA 2 achieves substantial task success, generalizes to held-out environments, and demonstrates open-ended self-improvement via a Gemini task setter and reward model. The work highlights strong generalization to photorealistic environments generated by Genie 3 and shows how embodied agents can be guided and improved with minimal human input, signaling a pathway toward more capable, lifelong-learning agents that could transfer to real-world robotics. It also discusses limitations and responsible development, outlining future directions for longer-horizon reasoning, memory, and robust low-level control in complex 3D worlds.

Abstract

We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.

Paper Structure

This paper contains 54 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: SIMA 2 is a Gemini-based agent that reasons, acts, and engages in dialogue across diverse embodied 3D virtual worlds. In the top left panel, we see an example of the agent responding to the user in No Man's Sky. As compared with SIMA 1, SIMA 2 is a step-change improvement in embodied performance, and it is even capable of self-improving in previously unseen environments.
  • Figure 2: Environments. The grid shows a sampling of images across the video game environments used to train and evaluate SIMA 2. Due to the complexity of open-world commercial video games, agents must handle a near-limitless variety of 3D configurations, menus, and underlying environment dynamics. This provides an ideal setting to develop and test embodied agents. By acquiring general embodiment capabilities in these environments, SIMA 2 is able to generalize in non-trivial ways to entirely new environments, including photorealistic environments generated by Genie 3.
  • Figure 3: Agent-Environment Interface. The agent receives a prompt that includes the current instruction. Conditioning on recent frames, the agent outputs internal reasoning, dialogue, and actions, with the agent specifying which modalities to produce at any given step.
  • Figure 4: Embodied Dialogue & Basic Reasoning. SIMA 2 contains a variety of new capabilities, including embodied dialogue and basic reasoning. Above, SIMA 2 answers a user's question through embodied interaction. Below, the agent correctly reasons that it needs to go to a red house based on the user's instruction. These new capabilities are unlocked by using Gemini within SIMA 2.
  • Figure 5: Complex Instructions & Multi-modal Prompting. By inheriting Gemini's language understanding capabilities, SIMA 2 can handle a variety of novel, complex instructions, including breaking down instructions to successfully navigate to a specific room. SIMA 2 can also be prompted with images, including sketches, to specify locations, paths, or objects.
  • ...and 16 more figures