Table of Contents
Fetching ...

Statler: State-Maintaining Language Models for Embodied Reasoning

Takuma Yoneda, Jiading Fang, Peng Li, Huanyu Zhang, Tianchong Jiang, Shengjie Lin, Ben Picker, David Yunis, Hongyuan Mei, Matthew R. Walter

TL;DR

Statler introduces a state-maintaining paradigm for embodied reasoning in robotics by deploying two prompting LLMs—one to read and one to write the world state—that together enable actions conditioned on an explicitly tracked latent state. Framed as a model-based extension of Code-as-Policies, Statler demonstrates superior performance over baselines on simulated tabletop tasks and real-robot experiments, particularly for queries requiring history-aware reasoning. Ablations reveal the value of separating the world-state reader and writer and underscore the importance of maintaining an external state rather than relying on implicit internal LLM memory. The work suggests scalability to longer-horizon planning and provides a modular prompt design with extensive demonstrations to bootstrap state maintenance in diverse domains.

Abstract

There has been a significant research interest in employing large language models to empower intelligent robots with complex reasoning. Existing work focuses on harnessing their abilities to reason about the histories of their actions and observations. In this paper, we explore a new dimension in which large language models may benefit robotics planning. In particular, we propose Statler, a framework in which large language models are prompted to maintain an estimate of the world state, which are often unobservable, and track its transition as new actions are taken. Our framework then conditions each action on the estimate of the current world state. Despite being conceptually simple, our Statler framework significantly outperforms strong competing methods (e.g., Code-as-Policies) on several robot planning tasks. Additionally, it has the potential advantage of scaling up to more challenging long-horizon planning tasks.

Statler: State-Maintaining Language Models for Embodied Reasoning

TL;DR

Statler introduces a state-maintaining paradigm for embodied reasoning in robotics by deploying two prompting LLMs—one to read and one to write the world state—that together enable actions conditioned on an explicitly tracked latent state. Framed as a model-based extension of Code-as-Policies, Statler demonstrates superior performance over baselines on simulated tabletop tasks and real-robot experiments, particularly for queries requiring history-aware reasoning. Ablations reveal the value of separating the world-state reader and writer and underscore the importance of maintaining an external state rather than relying on implicit internal LLM memory. The work suggests scalability to longer-horizon planning and provides a modular prompt design with extensive demonstrations to bootstrap state maintenance in diverse domains.

Abstract

There has been a significant research interest in employing large language models to empower intelligent robots with complex reasoning. Existing work focuses on harnessing their abilities to reason about the histories of their actions and observations. In this paper, we explore a new dimension in which large language models may benefit robotics planning. In particular, we propose Statler, a framework in which large language models are prompted to maintain an estimate of the world state, which are often unobservable, and track its transition as new actions are taken. Our framework then conditions each action on the estimate of the current world state. Despite being conceptually simple, our Statler framework significantly outperforms strong competing methods (e.g., Code-as-Policies) on several robot planning tasks. Additionally, it has the potential advantage of scaling up to more challenging long-horizon planning tasks.
Paper Structure (17 sections, 6 figures, 3 tables)

This paper contains 17 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our Statler framework enables robots to carry out complex tasks specified in natural language that require reasoning over long time horizons. Integral to our model are its world-state writer and world-state reader, two instances of general LLMs responsible for maintaining the explicit world state and generating code that enables the robot to carry out the task.
  • Figure 2: Model accuracies on the three-cups-and-a-ball shell game. LLM+State is a simplified version of our proposed Statler framework. For each method, the solid line shows how its accuracy $a(n)$ changes with the number of swaps $n$. The dashed line is the relative accuracy: $r(n) = a(n)/a(1)$. Intuitively, it measures how fast the performance decreases from a hypothetically perfect one-swap performance. Note that LLM+State indeed achieves $a(1)=100\%$
  • Figure 3: Examples of simulations that show the result of executing different natural language instructions using Code-as-Policies and our state-maintaining Statler algorithm.
  • Figure 4: The simulated domains we consider include \ref{['fig:pick-and-place']} Pick-and-Place; \ref{['fig:disinfection']} Block Disinfection, where the translucent sphere around a block represents its dirtiness (this is not visible to the robot); and \ref{['fig:weight-reasoning']} Relative Weight Reasoning, where the radius of the disk under each block indicates its weight (this is not visible to the robot).
  • Figure 5: Examples that show the result of querying LLMs with and without maintained state. In the first scenario, CaP fails to produce an answer, while our Statler model produces the correct response. In the second example, one block is not visible and CaP incorrectly identifies two blocks as not being a bowl. By maintaining a persistent world state, our method is aware of the third block and correctly answers the query.
  • ...and 1 more figures