Table of Contents
Fetching ...

GenSim: Generating Robotic Simulation Tasks via Large Language Models

Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang

TL;DR

GenSim addresses the data-intensive challenge of training general robotic policies by automatically generating diverse task-level simulation environments and demonstrations with large language models. It introduces a three-component pipeline (Task Creator, Task Library, and language-conditioned multitask policy training) and supports two modes: goal-directed and exploratory. Using GPT-4, it expands a small human task set to over 100 tasks, improving in-domain and zero-shot generalization, and achieving substantial sim-to-real gains after modest adaptation. The work highlights reduced human effort, scalable task coverage, and improved transfer to real-world long-horizon tasks, advancing the use of foundation-model-driven approaches in robotics.

Abstract

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.

GenSim: Generating Robotic Simulation Tasks via Large Language Models

TL;DR

GenSim addresses the data-intensive challenge of training general robotic policies by automatically generating diverse task-level simulation environments and demonstrations with large language models. It introduces a three-component pipeline (Task Creator, Task Library, and language-conditioned multitask policy training) and supports two modes: goal-directed and exploratory. Using GPT-4, it expands a small human task set to over 100 tasks, improving in-domain and zero-shot generalization, and achieving substantial sim-to-real gains after modest adaptation. The work highlights reduced human effort, scalable task coverage, and improved transfer to real-world long-horizon tasks, advancing the use of foundation-model-driven approaches in robotics.

Abstract

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.
Paper Structure (33 sections, 14 figures, 3 tables)

This paper contains 33 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Task gallery of over 100 tasks generated by GPT4. GenSim leverages a LLM code generation pipeline to scale up simulation tasks for policy training and task-level generalization.
  • Figure 2: GenSim is an LLM framework to scale up simulation task diversity for robotic policy training. We investigate goal-directed mode (top prompt) and exploratory mode (bottom prompt) that generate robotic simulation task codes. The generated task codes are cached in a task library which can be used for policy training to achieve better task-level generalization and sim-to-real adaptations.
  • Figure 3: Our automatic simulation task generation pipeline (top left) generates a task code that can be used to generate scenes, simulations, and expert demonstrations for imitation learning. In addition to common execution-based feedback in LLM program synthesis tasks, the LLM critic and the task library provide task quality feedback. Finally, humans and single-policy training can provide the final on the expert and learner rollouts without any extensive coding experience (Appendix \ref{['appendix:human_effort']}).
  • Figure 4: GenSim demonstrates interesting task-level composition and extrapolation behaviors in code generation for simulation tasks, which are distilled to policy learning through demonstrations.
  • Figure 5: The task library can be used for retrieval and finetuning in GenSim Pipeline. Moreover, task code embedding can be used to create an embedding space in the task library (visualized as a T-SNE plot), which can be used for clustering tasks and policy training. For example, the purple represents the tasks involving rope, and blue denotes tasks that involve building structures.
  • ...and 9 more figures