Grounding Large Language Models In Embodied Environment With Imperfect World Models

Haolan Liu; Jishen Zhao

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Haolan Liu, Jishen Zhao

TL;DR

A Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data and improves the performance of strong open-source LLMs like LLaMA-3 with a performance boost.

Abstract

Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.

Grounding Large Language Models In Embodied Environment With Imperfect World Models

TL;DR

Abstract

, 1.54

, and 1.82

across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.

Paper Structure (51 sections, 3 equations, 9 figures, 2 tables)

This paper contains 51 sections, 3 equations, 9 figures, 2 tables.

Introduction
Background and Related Work
Robot Learning With LLMs.
World Model.
Language Model Grounding.
Methodology
Problem Formulation
Perfect & Imperfect Environments.
Generating Grounded Embodied Experiences With LLM Agents
Agent Integration.
Reasoning.
Retrival-Augmented Experiences
Annotation Generation
Filtering.
Training.
...and 36 more sections

Figures (9)

Figure 1: Overview of GLIMO (Grounding Large language model with Imperfect world MOdel). Compared with previous approaches (Xiang2023LanguageMMzeng2023agenttuning) that rely on human annotation or downstream task design, GLIMO introduces an LLM agent-based framework that autonmously navigate in the imperfect world model and annotate the embodied experiences. It improves the data quality with a self refining prompting mechanism and retrival augmented generation and enhance policy exploration in favor of data diversity.
Figure 2: We collect imperfect experiences from the left imperfect world model and test it in the environment on the right. (1) The AgentWorld is a strategic 2D decision games, where players need to learn to improve themselves. The left figure (imperfect) is a simplified and steerable version with a smaller size and different elements and tools. (2) The Urban-Driving environment is based on CARLA simulator (DBLP:journals/corr/abs-1711-03938). We also build a simplified (imperfect) autonomous driving environment on the left.
Figure 3: (1) On the left shows the LLM agent-based pipeline guiding the sampling of embodied experiences and transforms these experiences into formats suitable for instruction tuning. In each iteration, the LLM agent executes one plan evaluates whether the goal is achieved. The LLM also compares the current plan with history experiences and annotates them on different length of steps. (2) On the right shows our prompt templates to (i) self-refine with past experiences (ii) generating rationale (or explanation) for decisions, to enable chain-of-thoughts finetuning lightman2023letsverifystepstep. (iii) to favor exploration and also enhance the diversity of our data. These templates guarantee our agents be able to produce high-quality and diverse embodied experiences.
Figure 4: This figure illustrates retrieval-augmented generation involving reasoning about the feasibility of different high-level plans, such as: "Which plan is better, challenge the boss directly or get the hammer first?" This approach allows us to reflect on prior experiences, determine which is better and generate reasoning data, which can be further transformed into QA datasets with varying format, as introduced in \ref{['subsec:annotation']}.
Figure 5: We compare the performance (success rate) our framework with varied ratio of imperfect training data, ranging from 0 to 1. (1) the left figure shows the performance of AgentWorld environment. (2) the right figure shows the performance of Urban-driving environment, we also evaluate its generalization capability in OOD scenarios.
...and 4 more figures

Grounding Large Language Models In Embodied Environment With Imperfect World Models

TL;DR

Abstract

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)