TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

Junik Bae; Kwanyoung Park; Youngwoon Lee

TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

Junik Bae, Kwanyoung Park, Youngwoon Lee

TL;DR

This work tackles unsupervised goal-conditioned reinforcement learning by addressing limited state coverage and long-horizon goal-reaching challenges. It introduces TLDR, a framework that learns temporal distance-aware representations to guide exploratory goal selection, intrinsic rewards, and the goal-conditioned policy within a Go-Explore-inspired setup. Empirical results across six state-based and two pixel-based locomotion tasks show that TLDR achieves substantially broader state coverage and robust goal-reaching, with ablations confirming the value of temporal-distance signals for both exploration and learning. Limitations include slower learning in pixel-based environments and potential safety considerations for real robots, suggesting avenues for future work in representation learning, model-based enhancements, and safety-aware deployment.

Abstract

Unsupervised goal-conditioned reinforcement learning (GCRL) is a promising paradigm for developing diverse robotic skills without external supervision. However, existing unsupervised GCRL methods often struggle to cover a wide range of states in complex environments due to their limited exploration and sparse or noisy rewards for GCRL. To overcome these challenges, we propose a novel unsupervised GCRL method that leverages TemporaL Distance-aware Representations (TLDR). Based on temporal distance, TLDR selects faraway goals to initiate exploration and computes intrinsic exploration rewards and goal-reaching rewards. Specifically, our exploration policy seeks states with large temporal distances (i.e. covering a large state space), while the goal-conditioned policy learns to minimize the temporal distance to the goal (i.e. reaching the goal). Our results in six simulated locomotion environments demonstrate that TLDR significantly outperforms prior unsupervised GCRL methods in achieving a wide range of states.

TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

TL;DR

Abstract

Paper Structure (40 sections, 6 equations, 22 figures, 2 tables, 1 algorithm)

This paper contains 40 sections, 6 equations, 22 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Approach
Problem Formulation
Learning Temporal Distance-Aware Representations
Unsupervised GCRL with Temporal Distance-Aware Representations
Exploratory Goal Selection
Learning Exploration Policy
Learning Goal-Conditioned Policy
Experiments
Experimental Setup
Tasks.
Comparisons.
Quantitative Results
Qualitative Results
...and 25 more sections

Figures (22)

Figure 1: Trajectories (red) of an ant robot in a complex maze trained by TLDR, METRA park2024metra, and PEG hu2022planning. While prior methods yield limited exploration, TLDR explores the entire maze.
Figure 2: Overview of TLDR algorithm. TLDR leverages temporal distance-aware representations for unsupervised GCRL. (a) We start by learning a state encoder $\phi(\mathbf{s})$ that maps states to temporal distance-aware representations. With the temporal distance-aware representations, TLDR (b) selects the temporally farthest state from the visited states as an exploratory goal, (c) reaches the chosen goal using a goal-conditioned policy, which learns to minimize temporal distance to the goal, and (d) collects exploratory trajectories using an exploration policy that visits states with large temporal distance from the visited states.
Figure 3: We evaluate our method on $6$ state-based robotic locomotion environments.
Figure 4: State coverage in state-based environments. We measure the state coverage of unsupervised exploration methods. Our method consistently shows superior state coverage compared to other methods, except in HalfCheetah compared against METRA.
Figure 5: Goal-reaching metrics of a goal-conditioned policy. For (a) Ant, (b) HalfCheetah, and (c) Humanoid-Run, we report the average distance between goals and the last states of trajectories (lower is better). TLDR achieves a comparable average goal distance to METRA. For AntMaze environments, we report the number of pre-defined goals reached by a goal-reaching policy ($7$ for (d) AntMaze-Large and $21$ for (e) AntMaze-Ultra), and TLDR significantly outperforms prior works.
...and 17 more figures

TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

TL;DR

Abstract

TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (22)