Table of Contents
Fetching ...

In-Context Learning Enables Robot Action Prediction in LLMs

Yida Yin, Zekai Wang, Yuvan Sharma, Dantong Niu, Trevor Darrell, Roei Herzig

TL;DR

RoboPrompt tackles the problem of predicting robot actions from observations without training by leveraging in-context learning in off-the-shelf text-only LLMs. It builds ICL demonstrations from textual encodings of discretized 6-DoF poses and observed actions, derived from keyframes and a pose estimator, and uses a structured prompt to induce action predictions at test time. Across RLBench simulations and real-robot experiments, RoboPrompt outperforms zero-shot and some ICL baselines, while remaining competitive with supervised methods on simpler tasks and highlighting the potential of LLM-driven robotics with minimal data requirements. The work demonstrates that careful demonstration selection, pose-text representations, and prompt structuring can transfer LLM reasoning capabilities to direct 6-DoF action prediction, offering a data-efficient alternative for manipulation in static environments.

Abstract

Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings. Our project page is available at https://davidyyd.github.io/roboprompt.

In-Context Learning Enables Robot Action Prediction in LLMs

TL;DR

RoboPrompt tackles the problem of predicting robot actions from observations without training by leveraging in-context learning in off-the-shelf text-only LLMs. It builds ICL demonstrations from textual encodings of discretized 6-DoF poses and observed actions, derived from keyframes and a pose estimator, and uses a structured prompt to induce action predictions at test time. Across RLBench simulations and real-robot experiments, RoboPrompt outperforms zero-shot and some ICL baselines, while remaining competitive with supervised methods on simpler tasks and highlighting the potential of LLM-driven robotics with minimal data requirements. The work demonstrates that careful demonstration selection, pose-text representations, and prompt structuring can transfer LLM reasoning capabilities to direct 6-DoF action prediction, offering a data-efficient alternative for manipulation in static environments.

Abstract

Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings. Our project page is available at https://davidyyd.github.io/roboprompt.

Paper Structure

This paper contains 19 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of RoboPrompt. We introduce a novel framework that enables an off-the-shelf text-only LLM to directly predict robot actions through in-context learning (ICL) examples without any additional training. Our method first identifies keyframes where critical robot actions occur. We next estimate initial object poses and extract robot actions from keyframes, and both are converted into textual descriptions. Using this textual information along with the given instruction, we construct a structured prompt as ICL demonstrations, enabling the LLM to predict robot actions directly for an unseen test sample.
  • Figure 2: Visualization of the first few predicted actions from RoboPrompt. Each predicted action captures an important moment in a task. With the estimated object poses, the gripper's orientation closely aligns with that of each object.
  • Figure 3: Ablations on RoboPrompt. We demonstrate (a) RoboPrompt with keyframes extraction outperforms uniform sampling with different intervals; (b) RoboPrompt's performance improves as the number of ICL examples increases; and (c) RoboPrompt can achieve high success rates under moderate levels of pose estimation noise.
  • Figure 4: Additional experiments on RoboPrompt. We demonstrate (a) RoboPrompt with origin action performs better than that with action tokens; (b) open-loop RoboPrompt does not boost the performance by a large margin; and (c) RoboPrompt can achieve a consistent high accuracy with different system prompts (light orange is standard deviation across prompts).
  • Figure 5: Setup. The real-robot setup with a Franka Emika Panda used for evaluating RoboPrompt.
  • ...and 4 more figures