Table of Contents
Fetching ...

FactorSim: Generative Simulation via Factorized Representation

Fan-Yun Sun, S. I. Harini, Angela Yi, Yihan Zhou, Alex Zook, Jonathan Tremblay, Logan Cross, Jiajun Wu, Nick Haber

TL;DR

This work introduces FACTORSIM, a generative simulation code that generates full simulations in code from language input that can be used to train agents and outperforms existing methods in generating simulations regarding prompt alignment, zero-shot transfer abilities, and human evaluation.

Abstract

Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code's accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.

FactorSim: Generative Simulation via Factorized Representation

TL;DR

This work introduces FACTORSIM, a generative simulation code that generates full simulations in code from language input that can be used to train agents and outperforms existing methods in generating simulations regarding prompt alignment, zero-shot transfer abilities, and human evaluation.

Abstract

Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code's accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.
Paper Structure (22 sections, 6 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of FactorSim. FactorSim takes language documentation as input, uses Chain-of-Thought to derive a series of steps to be implemented, adopts a Factored POMDP representation to facilitate efficient context selection during each generation step, trains agents on the generated simulations, and tests the resulting policy on previously unseen RL environments.
  • Figure 2: An illustrative example of how the five main prompts in FactorSim correspond to our formulation in Algorithm 1. Note that the function red_puck_respawn is retrieved as part of the context to Prompt 3, 4, and 5 because it modifies the state variable red_puck_position, a state variable LLM identified as relevant in prompt 2.
  • Figure 3: Performance and token usage analysis of GPT-4-based methods. Ellipses correspond to 90% confidence intervals for each algorithm, aggregated over all RL games.
  • Figure 4: Zero-shot transfer results on previously unseen environments (i.e., environments in the original RL benchmark tasfi2016PLE).
  • Figure 5: Human evaluation results on the generated simulations of FactorSim and the strongest baseline (i.e., GPT-4 CoT w/ self-debug), aggregated over all 8 RL games.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 3.1