Table of Contents
Fetching ...

A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World

Kaixian Qu, Guowei Lan, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter

TL;DR

The paper addresses the mismatch between LLM-driven planning and real-world robotic embodiment by proposing PragmaBot, a framework that grounds planning in vision-language reasoning, short-term and long-term memories, and self-reflection. It leverages a VLM as both perception and planner, uses STM for online adaptation, stores lessons in LTM, and employs retrieval-augmented generation to plan with past experiences, enhanced by an on-demand image annotation module for grounded actions. Empirical results show substantial gains: STM-based self-reflection raises task success from 35% to 84% across four challenging tasks, and LTM+RAG boosts single-trial success on 12 real-world scenarios from 22% to 80%, with RAG outperforming naive prompting. The findings demonstrate effective lifelong, embodied task planning without costly model fine-tuning, with practical implications for deploying adaptive, data-efficient robots in dynamic environments.

Abstract

Large language models (LLMs) have emerged as the dominant paradigm for robotic task planning using natural language instructions. However, trained on general internet data, LLMs are not inherently aligned with the embodiment, skill sets, and limitations of real-world robotic systems. Inspired by the emerging paradigm of verbal reinforcement learning-where LLM agents improve through self-reflection and few-shot learning without parameter updates-we introduce PragmaBot, a framework that enables robots to learn task planning through real-world experience. PragmaBot employs a vision-language model (VLM) as the robot's "brain" and "eye", allowing it to visually evaluate action outcomes and self-reflect on failures. These reflections are stored in a short-term memory (STM), enabling the robot to quickly adapt its behavior during ongoing tasks. Upon task completion, the robot summarizes the lessons learned into its long-term memory (LTM). When facing new tasks, it can leverage retrieval-augmented generation (RAG) to plan more grounded action sequences by drawing on relevant past experiences and knowledge. Experiments on four challenging robotic tasks show that STM-based self-reflection increases task success rates from 35% to 84%, with emergent intelligent object interactions. In 12 real-world scenarios (including eight previously unseen tasks), the robot effectively learns from the LTM and improves single-trial success rates from 22% to 80%, with RAG outperforming naive prompting. These results highlight the effectiveness and generalizability of PragmaBot. Project webpage: https://pragmabot.github.io/

A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World

TL;DR

The paper addresses the mismatch between LLM-driven planning and real-world robotic embodiment by proposing PragmaBot, a framework that grounds planning in vision-language reasoning, short-term and long-term memories, and self-reflection. It leverages a VLM as both perception and planner, uses STM for online adaptation, stores lessons in LTM, and employs retrieval-augmented generation to plan with past experiences, enhanced by an on-demand image annotation module for grounded actions. Empirical results show substantial gains: STM-based self-reflection raises task success from 35% to 84% across four challenging tasks, and LTM+RAG boosts single-trial success on 12 real-world scenarios from 22% to 80%, with RAG outperforming naive prompting. The findings demonstrate effective lifelong, embodied task planning without costly model fine-tuning, with practical implications for deploying adaptive, data-efficient robots in dynamic environments.

Abstract

Large language models (LLMs) have emerged as the dominant paradigm for robotic task planning using natural language instructions. However, trained on general internet data, LLMs are not inherently aligned with the embodiment, skill sets, and limitations of real-world robotic systems. Inspired by the emerging paradigm of verbal reinforcement learning-where LLM agents improve through self-reflection and few-shot learning without parameter updates-we introduce PragmaBot, a framework that enables robots to learn task planning through real-world experience. PragmaBot employs a vision-language model (VLM) as the robot's "brain" and "eye", allowing it to visually evaluate action outcomes and self-reflect on failures. These reflections are stored in a short-term memory (STM), enabling the robot to quickly adapt its behavior during ongoing tasks. Upon task completion, the robot summarizes the lessons learned into its long-term memory (LTM). When facing new tasks, it can leverage retrieval-augmented generation (RAG) to plan more grounded action sequences by drawing on relevant past experiences and knowledge. Experiments on four challenging robotic tasks show that STM-based self-reflection increases task success rates from 35% to 84%, with emergent intelligent object interactions. In 12 real-world scenarios (including eight previously unseen tasks), the robot effectively learns from the LTM and improves single-trial success rates from 22% to 80%, with RAG outperforming naive prompting. These results highlight the effectiveness and generalizability of PragmaBot. Project webpage: https://pragmabot.github.io/

Paper Structure

This paper contains 17 sections, 6 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Robot completes a new task guided by a long-term memory of self-reflective experiences. When executing a novel task, the robot maintains a short-term memory that helps it reflect and learn how to complete the task (illustrated in the grey clip). The experience is then stored as long-term memory and retrieved to guide the VLM’s task planning whenever a similar scenario is encountered (illustrated in the main figure).
  • Figure 2: PragmaBot
  • Figure 3: Prompt templates used in different VLM modules.
  • Figure 4: Illustration of image annotation tools.
  • Figure 5: Overview of all experimental scenes (a) and demonstration of PragmaBot's performance across four representative scenarios (b). In the first two examples (top and middle rows), the robot successfully completes the tasks after self-reflection. These experiences are then summarized and stored in LTM, enabling the robot to generalize its learning to similar future scenarios (bottom row).
  • ...and 3 more figures