Table of Contents
Fetching ...

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi

TL;DR

Drawing upon human reflective practitioners, Reflective Test-Time Planning is introduced, which integrates two modes of reflection: reflection-in-action and reflection-on-action, which uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution.

Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

TL;DR

Drawing upon human reflective practitioners, Reflective Test-Time Planning is introduced, which integrates two modes of reflection: reflection-in-action and reflection-on-action, which uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution.

Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
Paper Structure (47 sections, 19 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 47 sections, 19 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 2: Method overview. (a) Reflection-in-action: multiple candidate actions are generated and scored by an internal reflection LLM prior to execution. (b) Reflection-on-action: iteratively invoked when working memory hits K or at key milestones. Executed actions are critiqued by an external reflection LLM and stored in a working memory buffer; at milestones, hindsight re-evaluation assigns long-horizon credit. The resulting verbal reflections form self-supervised training data to update both the internal reflection LLM (supervised loss) and the action LLM (policy gradient) via test-time training, enabling agents to learn from execution experience during deployment.
  • Figure 3: Cupboard Fitting results. Blue bars show correct-placement rate, pink bars show fit rate. RIA means Reflection-in-action; ROA means Reflection-on-action. "W/o external reflection" means that we don't use external reflection as the input to the action generation LLM. We implement two test-time training variants for ROA: one is test-time training on all base weights; and the other is test-time training on LoRA parameters only. Reflective Test-Time Planning significantly improves both success metrics.
  • Figure 4: Examples of the Cupboard Fitting Task.
  • Figure 5: Qualitative Examples. Steps and reflections are simplified for better presentations. Blue text shows internal reflection used for candidate selection, orange text shows external reflection after execution, and red text suggests retrospective reflection. (a) Long-Horizon Household example. We use retro & internal because the generated retro reflection is also used to train the internal model. (b) Real-robot Cupboard Fitting example. We put reflection scores inside brackets, omit detailed reflections and only present the scores for simplicity.
  • Figure 6: Hyperparameter ablation studies on Cupboard Fitting.Top Left: Performance vs. number of candidate actions. Peak performance (60.0%) occurs at N=6 candidates, demonstrating that internal reflection effectively identifies superior actions from diverse pools. Beyond N=6, performance plateaus as excessive candidates add computational cost without improving the best candidate quality. Top Right: Performance vs. sampling temperature. Optimal temperature range (T=1.25-1.5) balances candidate diversity with quality—temperatures below 0.5 produce overly similar candidates that limit reflection value, while temperatures above 1.75 generate incoherent actions that even accurate reflection cannot salvage. Bottom Left: Performance vs. LoRA configuration (rank, alpha). The optimal configuration (r=8, $\alpha$=16) achieves 60.0% performance, balancing adaptation capacity with training stability. Smaller configurations like (4,4) underfit with insufficient capacity (52.5%), while larger configurations cause mode collapse during test-time training—(16,32) drops to 41.5% and (32,32) collapses to 34.8% as the model begins predicting identical outputs for all inputs, losing the ability to distinguish between different spatial configurations and task contexts. Bottom Right: Performance vs. action budget (maximum steps). Performance improves dramatically from 30 steps (51.5%) to 50 steps (60.0%), but slightly degrades to 59.4% at 100 steps, suggesting that excessive action budgets allow suboptimal exploration strategies that accumulate errors over longer horizons.
  • ...and 2 more figures