Table of Contents
Fetching ...

GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance

Arthur Bucker, Pablo Ortega-Kral, Jonathan Francis, Jean Oh

TL;DR

GRAPPA tackles the challenge of generalizing robot policies to unseen environments without extra demonstrations by introducing an online, agentic framework that grounds and refines pre-trained policies via visuomotor guidance. A multi-agent loop (Advisor, Grounding, Monitor, Robotic) generates task-specific guidance code that biases the base policy's action distribution, leveraging a dynamics model and visual grounding (e.g., SAM) to locate objects and plan improvements. The approach demonstrates improved sim-to-real transfer, robust performance in cluttered real-world settings, and the ability to learn new skills from scratch, while preserving low-level control details from the original policies. These findings suggest that online, grounded guidance can significantly enhance robustness and adaptability of robotic systems in real-world deployment with limited additional data.

Abstract

Robot learning approaches such as behavior cloning and reinforcement learning have shown great promise in synthesizing robot skills from human demonstrations in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for unseen real-world settings. Recent advances in the use of foundation models for robotics (e.g., LLMs, VLMs) have shown great potential in enabling systems to understand the semantics in the world from large-scale internet data. However, it remains an open challenge to use this knowledge to enable robotic systems to understand the underlying dynamics of the world, to generalize policies across different tasks, and to adapt policies to new environments. To alleviate these limitations, we propose an agentic framework for robot self-guidance and self-improvement, which consists of a set of role-specialized conversational agents, such as a high-level advisor, a grounding agent, a monitoring agent, and a robotic agent. Our framework iteratively grounds a base robot policy to relevant objects in the environment and uses visuomotor cues to shift the action distribution of the policy to more desirable states, online, while remaining agnostic to the subjective configuration of a given robot hardware platform. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates, both in simulation and in real-world experiments, without the need for additional human demonstrations or extensive exploration. Code and videos available at: https://agenticrobots.github.io

GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance

TL;DR

GRAPPA tackles the challenge of generalizing robot policies to unseen environments without extra demonstrations by introducing an online, agentic framework that grounds and refines pre-trained policies via visuomotor guidance. A multi-agent loop (Advisor, Grounding, Monitor, Robotic) generates task-specific guidance code that biases the base policy's action distribution, leveraging a dynamics model and visual grounding (e.g., SAM) to locate objects and plan improvements. The approach demonstrates improved sim-to-real transfer, robust performance in cluttered real-world settings, and the ability to learn new skills from scratch, while preserving low-level control details from the original policies. These findings suggest that online, grounded guidance can significantly enhance robustness and adaptability of robotic systems in real-world deployment with limited additional data.

Abstract

Robot learning approaches such as behavior cloning and reinforcement learning have shown great promise in synthesizing robot skills from human demonstrations in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for unseen real-world settings. Recent advances in the use of foundation models for robotics (e.g., LLMs, VLMs) have shown great potential in enabling systems to understand the semantics in the world from large-scale internet data. However, it remains an open challenge to use this knowledge to enable robotic systems to understand the underlying dynamics of the world, to generalize policies across different tasks, and to adapt policies to new environments. To alleviate these limitations, we propose an agentic framework for robot self-guidance and self-improvement, which consists of a set of role-specialized conversational agents, such as a high-level advisor, a grounding agent, a monitoring agent, and a robotic agent. Our framework iteratively grounds a base robot policy to relevant objects in the environment and uses visuomotor cues to shift the action distribution of the policy to more desirable states, online, while remaining agnostic to the subjective configuration of a given robot hardware platform. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates, both in simulation and in real-world experiments, without the need for additional human demonstrations or extensive exploration. Code and videos available at: https://agenticrobots.github.io
Paper Structure (16 sections, 3 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Heatmap visualization of the guidance distribution, generated online by our proposed agentic framework, which produces code that biases a robot policy's action distribution towards desirable behavior.
  • Figure 2: An illustration of how GRAPPA intervenes in the action loop of pre-trained robotic policies in failure cases to provide visuomotor guidance generated with an agentic framework of agents to shift the action distribution for correct task execution.
  • Figure 3: Information flow between the agents to produce a guidance code. a) The advisor agent orchestrates guidance code generation by collaborating with other agents and using their feedback to refine the generated code. b) The grounding agent uses segmentation and classification models to locate objects of interest provided by the advisor, reporting findings back to the advisor. c) The robotic agent uses a Python interpreter to test the code for the specific robotic platform and judge the adequacy of the code. d) The monitor agent analyses the sequence of frames corresponding to the rollout of the guidance and give feedback on potential improvements.
  • Figure 4: Real-world results for learning skills from scratch on the UFactory xArm Lite6 chess task. The top row shows an external view of the robot performing the tasks. The second row depicts the action heat map given by the random diffuser policy at the first and last time step. The bottom row depicts the corresponding heat maps generated after applying the guidance. We show it can successfully guide the action towards the desired object. On the right, we show a breakdown of the multi-granular search performed by other grounding agent to locate the white knight; we disambiguate the scene by searching in parent objects and constraining the search to semantically relevant areas.
  • Figure 5: GRAPPA guiding the base policy for out-of-distribution cases. The task involves grasping a deformable toy ball and placing it inside a box.
  • ...and 6 more figures