Table of Contents
Fetching ...

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Yuxiao Yang, Shenao Zhang, Zhihan Liu, Huaxiu Yao, Zhaoran Wang

TL;DR

This work reframes Embodied Instruction Following (EIF) as a Partially Observable Markov Decision Process (POMDP) and presents a closed-loop Hindsight Planner that operates effectively under few-shot large language model (LLM) reasoning. The approach integrates an adaptation module to infer latent state, a long-horizon actor–critic planner (RAFA), and a novel hindsight relabeling mechanism to leverage suboptimal trajectories during training and deployment. Key contributions include: (1) a POMDP-centric planning framework for EIF, (2) a hindsight prompting strategy that preserves task distributions while enriching learning signals, and (3) demonstrated state-of-the-art few-shot performance on ALFRED, approaching or surpassing some full-shot supervised baselines. The results indicate substantial robustness gains in long-horizon tasks and highlight the practical potential of combining LLM-based reasoning with structured planning under partial observability.

Abstract

This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent's performance approaches and even surpasses that of the full-shot supervised agent.

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

TL;DR

This work reframes Embodied Instruction Following (EIF) as a Partially Observable Markov Decision Process (POMDP) and presents a closed-loop Hindsight Planner that operates effectively under few-shot large language model (LLM) reasoning. The approach integrates an adaptation module to infer latent state, a long-horizon actor–critic planner (RAFA), and a novel hindsight relabeling mechanism to leverage suboptimal trajectories during training and deployment. Key contributions include: (1) a POMDP-centric planning framework for EIF, (2) a hindsight prompting strategy that preserves task distributions while enriching learning signals, and (3) demonstrated state-of-the-art few-shot performance on ALFRED, approaching or surpassing some full-shot supervised baselines. The results indicate substantial robustness gains in long-horizon tasks and highlight the practical potential of combining LLM-based reasoning with structured planning under partial observability.

Abstract

This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent's performance approaches and even surpasses that of the full-shot supervised agent.
Paper Structure (23 sections, 6 equations, 3 figures, 5 tables, 3 algorithms)

This paper contains 23 sections, 6 equations, 3 figures, 5 tables, 3 algorithms.

Figures (3)

  • Figure 1: Left: The illustration of the Hindsight Planner: at each time step $t$, the planner receives a partial observation $y^t$ from the environment. The adaptation module estimates the latent variable and concatenates it with $y^t$ to produce the complete state $x_t$. $\texttt{Actor}_\texttt{hind}$ and $\texttt{Actor}_\texttt{gt}$ are prompted with different samples and make decisions. The Critic is utilized to evaluate the actions. The best rollout $(x_t,a_t^*,x_{t+1}^*,a_{t+1}^*\ldots)$ is selected, and $a_t^*$ is returned. Right: An example of the relabeling process for the $\texttt{Actor}_\texttt{hind}$: after collecting a suboptimal rollout, the LLM is prompted to generate a reflection on the previously taken actions. Following this reflection, the LLM is then prompted to complete the suboptimal rollout.
  • Figure 2: A comparison of Hindsight Planner and previous supervised methods when taking a suboptimal action. The agent initially picks up the incorrect object ("Basketball"). In the supervised method, the planner fails to handle this situation, which leads to task failure. In contrast, the Hindsight Planner can adjust after the incorrect action and successfully complete the task.
  • Figure 3: The entire process of the Hindsight Planner is as follows: At the start of the task, which is to "Place a plate with a ladle on it in a cabinet," the Adapter mistakenly identifies the task as picking up a plate and placing it into a cabinet. $\texttt{Actor}_\texttt{hind}$ and $\texttt{Actor}_\texttt{gt}$ make decisions separately. Critic then selects the best action as its output. Upon further exploration, the agent detects more objects, and the Adapter adjusts its output, recognizing the task as stacking a ladle onto a plate and then placing them into a cabinet. The Actors and Critic subsequently make decisions based on the revised predictions.