Latent Plans for Task-Agnostic Offline Reinforcement Learning

Erick Rosete-Beas; Oier Mees; Gabriel Kalweit; Joschka Boedecker; Wolfram Burgard

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, Wolfram Burgard

TL;DR

The paper presents TACO-RL, a hierarchical framework that leverages imitation-learned latent skills for a low-level policy and offline RL for a high-level policy to solve long-horizon, goal-conditioned tasks from unstructured offline play data. By auto-encoding latent plans and applying hindsight relabeling, the method stitches short-horizon behaviors into temporally extended strategies, achieving significant improvements over state-of-the-art baselines in both simulation and real-world robotics. The approach yields a single visuomotor policy capable of 25 real-world tasks and demonstrates robust long-horizon reasoning, highlighting a practical path toward scalable, task-agnostic robot control from unlabeled data.

Abstract

Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have complementary strengths and weaknesses, we propose a novel hierarchical approach that combines the strengths of both methods to learn task-agnostic long-horizon policies from high-dimensional camera observations. Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors. Experiments in various simulated and real robot control tasks show that our formulation enables producing previously unseen combinations of skills to reach temporally extended goals by "stitching" together latent skills through goal chaining with an order-of-magnitude improvement in performance upon state-of-the-art baselines. We even learn one multi-task visuomotor policy for 25 distinct manipulation tasks in the real world which outperforms both imitation learning and offline reinforcement learning techniques.

Latent Plans for Task-Agnostic Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (38 sections, 4 equations, 9 figures, 5 tables)

This paper contains 38 sections, 4 equations, 9 figures, 5 tables.

Introduction
Related Work
Mathematical Foundation
Offline goal-conditioned RL with TACO-RL
Learning the low-level policy
Offline RL with Hindsight relabeling
Experimental Results
Experimental Setup
Simulation Results
Real-Robot Experiments
Conclusion and Limitations
Teleoperation Interface
Simulation
Real World
Experimental Setup Details
...and 23 more sections

Figures (9)

Figure 1: TACO-RL learns a single 7-DoF hierarchical visuomotor policy from offline data. It can solve long-horizon robot manipulation tasks by using a high-level policy that divides a task into a sequence of latent behaviors that are executed by a low-level policy that interacts with the environment. It reduces the effective horizon of the high-level policy and learns to chain skills through dynamic programming.
Figure 2: TACO-RL Overview. TACO-RL is a self-supervised general-purpose model learned from an offline dataset of robot interactions, it generalizes to a wide variety of long-horizon manipulation tasks. (1) Low-level policy: Recognizes and organizes a repertoire of behaviors from unlabeled, undirected dataset in a latent plan space. (2) High-level policy: Hindsight relabeling of sampled windows of experience into reward-augmented latent plan transitions. Learned with offline RL, this allows the high-level policy to stitch plans together to achieve complex long-horizon tasks. (3) Inference: the hierarchical model is used to perform goal-conditioned rollouts in robot manipulation tasks.
Figure 3: We relabel sampled trajectories into reward augmented transitions by sampling goal states that can be reached after executing a sequence of behaviors. With green border, we have the frame found at the end of the sampled trajectory. As this state will be reached after executing the latent behavior, the reward for this transition is 1. With blue border, we find future states that occur after the sampled sequence. These goals are necessary for chaining behaviors. The reward for these transitions is 0. Finally, with red border, we present images with similar proprioceptive information to the final state in the sampled trajectory, but a different scene arrangement. The reward for these transitions is 0.
Figure 4: Real-world Manipulation Tasks. Examples shown from left to right are: closing the drawer, opening the drawer, moving the sliding door left, moving the sliding door right, lifting the block, rotating the block, pushing the block, turning the green LED on, placing the block on top of the drawer and placing the block in the container.
Figure 5: Visualization of the real world data collection procedure (left) and the full robot setup (right).
...and 4 more figures

Latent Plans for Task-Agnostic Offline Reinforcement Learning

TL;DR

Abstract

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)