Backward Learning for Goal-Conditioned Policies
Marc Höftmann, Jan Robine, Stefan Harmeling
TL;DR
The paper tackles reward-free reinforcement learning by introducing backward learning for goal-conditioned policies. It leverages a backward world model that predicts previous states and generates backward trajectories from a goal state $s_g$, which are then refined and used for imitation learning via a Shortest Path Estimator (SPE) on a directed graph of observed transitions. The approach supports multiple goals and negative goals, enabling data-efficient policy learning without extrinsic rewards. Demonstrated on a deterministic maze with $64\times 64$ observations, the method consistently reaches multiple goals and shows improved generalization through clockwise multi-goal strategies. This work offers a principled, model-based framework for efficient goal-directed control in reward-free settings and provides a blueprint for backward planning in RL.
Abstract
Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.
