Backward Learning for Goal-Conditioned Policies

Marc Höftmann; Jan Robine; Stefan Harmeling

Backward Learning for Goal-Conditioned Policies

Marc Höftmann, Jan Robine, Stefan Harmeling

TL;DR

The paper tackles reward-free reinforcement learning by introducing backward learning for goal-conditioned policies. It leverages a backward world model that predicts previous states and generates backward trajectories from a goal state $s_g$, which are then refined and used for imitation learning via a Shortest Path Estimator (SPE) on a directed graph of observed transitions. The approach supports multiple goals and negative goals, enabling data-efficient policy learning without extrinsic rewards. Demonstrated on a deterministic maze with $64\times 64$ observations, the method consistently reaches multiple goals and shows improved generalization through clockwise multi-goal strategies. This work offers a principled, model-based framework for efficient goal-directed control in reward-free settings and provides a blueprint for backward planning in RL.

Abstract

Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.

Backward Learning for Goal-Conditioned Policies

TL;DR

, which are then refined and used for imitation learning via a Shortest Path Estimator (SPE) on a directed graph of observed transitions. The approach supports multiple goals and negative goals, enabling data-efficient policy learning without extrinsic rewards. Demonstrated on a deterministic maze with

observations, the method consistently reaches multiple goals and shows improved generalization through clockwise multi-goal strategies. This work offers a principled, model-based framework for efficient goal-directed control in reward-free settings and provides a blueprint for backward planning in RL.

Abstract

pixel bird's eye images and can show that it consistently reaches several goals.

Paper Structure (21 sections, 4 equations, 1 figure, 1 table)

This paper contains 21 sections, 4 equations, 1 figure, 1 table.

Introduction
Prior Work
Method
Experiments
Submission of conference papers to ICLR 2024
Style
Retrieval of style files
General formatting instructions
Headings: first level
Headings: second level
Headings: third level
Citations, figures, tables, references
Citations within the text
Footnotes
Figures
...and 6 more sections

Figures (1)

Figure 1: Sample figure caption.

Backward Learning for Goal-Conditioned Policies

TL;DR

Abstract

Backward Learning for Goal-Conditioned Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (1)