Table of Contents
Fetching ...

ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models

Runyu Ma, Jelle Luijkx, Zlatan Ajanovic, Jens Kober

TL;DR

ExploRLLM fuses foundation-model-guided exploration with reinforcement learning to tackle sample-inefficient tabletop manipulation in robotics. It uses vision-language detections to reduce observation space and LLM-generated policy code to propose exploratory actions, while a residual RL agent refines outcomes to compensate for FMs' physical gaps. Across simulation and real-world experiments, ExploRLLM achieves faster convergence, higher success rates, and promising zero-shot sim-to-real transfer, outperforming FM-only and RL baselines. The approach generalizes to unseen colors and letters, reducing reliance on extensive real-world data and enabling more robust robotic manipulation.

Abstract

In robot manipulation, Reinforcement Learning (RL) often suffers from low sample efficiency and uncertain convergence, especially in large observation and action spaces. Foundation Models (FMs) offer an alternative, demonstrating promise in zero-shot and few-shot settings. However, they can be unreliable due to limited physical and spatial understanding. We introduce ExploRLLM, a method that combines the strengths of both paradigms. In our approach, FMs improve RL convergence by generating policy code and efficient representations, while a residual RL agent compensates for the FMs' limited physical understanding. We show that ExploRLLM outperforms both policies derived from FMs and RL baselines in table-top manipulation tasks. Additionally, real-world experiments show that the policies exhibit promising zero-shot sim-to-real transfer. Supplementary material is available at https://explorllm.github.io.

ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models

TL;DR

ExploRLLM fuses foundation-model-guided exploration with reinforcement learning to tackle sample-inefficient tabletop manipulation in robotics. It uses vision-language detections to reduce observation space and LLM-generated policy code to propose exploratory actions, while a residual RL agent refines outcomes to compensate for FMs' physical gaps. Across simulation and real-world experiments, ExploRLLM achieves faster convergence, higher success rates, and promising zero-shot sim-to-real transfer, outperforming FM-only and RL baselines. The approach generalizes to unseen colors and letters, reducing reliance on extensive real-world data and enabling more robust robotic manipulation.

Abstract

In robot manipulation, Reinforcement Learning (RL) often suffers from low sample efficiency and uncertain convergence, especially in large observation and action spaces. Foundation Models (FMs) offer an alternative, demonstrating promise in zero-shot and few-shot settings. However, they can be unreliable due to limited physical and spatial understanding. We introduce ExploRLLM, a method that combines the strengths of both paradigms. In our approach, FMs improve RL convergence by generating policy code and efficient representations, while a residual RL agent compensates for the FMs' limited physical understanding. We show that ExploRLLM outperforms both policies derived from FMs and RL baselines in table-top manipulation tasks. Additionally, real-world experiments show that the policies exhibit promising zero-shot sim-to-real transfer. Supplementary material is available at https://explorllm.github.io.
Paper Structure (20 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Graphical overview of ExploRLLM.
  • Figure 2: Implementation structure of ExploRLLM for tabletop manipulation, combining the strengths of RL and FMs.
  • Figure 3: Based on an exploration prompt, candidate policy code is generated. The exploration policy is selected after evaluation.
  • Figure 4: Training curves for varying exploration rates in SH and LH tasks. ExploRLLM outperforms the exploration policies (dashed lines) and RL without LLM-based exploration ($\epsilon=0$). In the LH task, LLM-based exploration is crucial for success.
  • Figure 5: Short-horizon ExploRLLM policies can be used in long-horizon tasks with zero-shot LLM planners, e.g., zeng2022socratic.
  • ...and 1 more figures