Table of Contents
Fetching ...

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Murtaza Dalal, Tarun Chiruvolu, Devendra Chaplot, Ruslan Salakhutdinov

TL;DR

The paper tackles the challenge of long-horizon robotic tasks without relying on predefined skill libraries by introducing Plan-Seq-Learn (PSL), a modular framework that ties high-level language planning to low-level control. PSL uses an LLM (Plan) to generate a region-based plan, a vision-guided Sequences module (Seq) to initialize motion planning, and a reinforcement learning learner (Learn) to acquire local control policies, shared across task stages. It demonstrates state-of-the-art performance across 25+ long-horizon tasks spanning four benchmarks, achieving high success rates from raw visual input and showing robustness to pose estimation noise and plan imperfections. The results suggest PSL can reduce engineering effort by leveraging web-scale knowledge and planning capabilities while maintaining sample efficiency and adaptability for complex robotic manipulation.

Abstract

Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

TL;DR

The paper tackles the challenge of long-horizon robotic tasks without relying on predefined skill libraries by introducing Plan-Seq-Learn (PSL), a modular framework that ties high-level language planning to low-level control. PSL uses an LLM (Plan) to generate a region-based plan, a vision-guided Sequences module (Seq) to initialize motion planning, and a reinforcement learning learner (Learn) to acquire local control policies, shared across task stages. It demonstrates state-of-the-art performance across 25+ long-horizon tasks spanning four benchmarks, achieving high success rates from raw visual input and showing robustness to pose estimation noise and plan imperfections. The results suggest PSL can reduce engineering effort by leveraging web-scale knowledge and planning capabilities while maintaining sample efficiency and adaptability for complex robotic manipulation.

Abstract

Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/
Paper Structure (39 sections, 11 figures, 5 tables, 2 algorithms)

This paper contains 39 sections, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Long horizon task visualization. We visualize PSL solving the NutAssembly task, in which the goal is to put both nuts on their respective pegs. After predicting the high-level plan using an LLM, PSL computes a target robot pose, achieves it using motion planning and then learns interaction via RL (third row).
  • Figure 2: Method overview. PSL decomposes tasks into a list of regions and stage termination conditions using an LLM (top), sequences the plan using motion planning (left) and learns control policies using RL (right).
  • Figure 3: Sample Efficiency Results. We plot task success rate as a function of the number of trials. PSL improves on the sample efficiency of the baselines across each task in Robosuite, Kitchen, Meta-World, and Obstructed Suite. PSL is able to do so because it initializes the RL policy near the region of interest (as predicted by the Plan and Sequence Modules) and leverages local observations to efficiently learn interaction. Additional learning curves in Appendix \ref{['app:additional exps']}.
  • Figure : Plan-Seq-Learn Overview
  • Figure C.1: Camera View Learning Performance Ablation. wrist camera views clearly accelerate learning performance, converging to near 100% performance 4x faster than using fixed-view and 3x faster than using wrist+fixed-view observations.
  • ...and 6 more figures