Table of Contents
Fetching ...

Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

TL;DR

This work benchmarks reinforcement learning and optimal-control paradigms on reward-free offline data for navigation tasks, introducing Planning with a Latent Dynamics Model (PLDM) as a latent-space planning approach using JEPA. The study reveals that model-free RL benefits from large, high-quality data, while PLDM offers superior data efficiency and generalizes better to unseen layouts and tasks, including zero-shot adaptations. Key findings include PLDM’s robustness to suboptimal data, strong trajectory stitching in higher-dimensional controls, and outperformance in generalization to new environments; these results are supported by extensive ablations and diverse environments. The work advocates latent-dynamics planning as a promising direction for building general autonomous agents from reward-free offline data.

Abstract

A long-standing goal in AI is to develop agents capable of solving diverse tasks across a range of environments, including those never seen during training. Two dominant paradigms address this challenge: (i) reinforcement learning (RL), which learns policies via trial and error, and (ii) optimal control, which plans actions using a known or learned dynamics model. However, their comparative strengths in the offline setting - where agents must learn from reward-free trajectories - remain underexplored. In this work, we systematically evaluate RL and control-based methods on a suite of navigation tasks, using offline datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot methods. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning. We investigate how factors such as data diversity, trajectory quality, and environment variability influence the performance of these approaches. Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts and is more data-efficient, while achieving trajectory stitching performance comparable to leading model-free methods. Notably, planning with a latent dynamics model proves to be a strong approach for handling suboptimal offline data and adapting to diverse environments.

Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

TL;DR

This work benchmarks reinforcement learning and optimal-control paradigms on reward-free offline data for navigation tasks, introducing Planning with a Latent Dynamics Model (PLDM) as a latent-space planning approach using JEPA. The study reveals that model-free RL benefits from large, high-quality data, while PLDM offers superior data efficiency and generalizes better to unseen layouts and tasks, including zero-shot adaptations. Key findings include PLDM’s robustness to suboptimal data, strong trajectory stitching in higher-dimensional controls, and outperformance in generalization to new environments; these results are supported by extensive ablations and diverse environments. The work advocates latent-dynamics planning as a promising direction for building general autonomous agents from reward-free offline data.

Abstract

A long-standing goal in AI is to develop agents capable of solving diverse tasks across a range of environments, including those never seen during training. Two dominant paradigms address this challenge: (i) reinforcement learning (RL), which learns policies via trial and error, and (ii) optimal control, which plans actions using a known or learned dynamics model. However, their comparative strengths in the offline setting - where agents must learn from reward-free trajectories - remain underexplored. In this work, we systematically evaluate RL and control-based methods on a suite of navigation tasks, using offline datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot methods. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning. We investigate how factors such as data diversity, trajectory quality, and environment variability influence the performance of these approaches. Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts and is more data-efficient, while achieving trajectory stitching performance comparable to leading model-free methods. Notably, planning with a latent dynamics model proves to be a strong approach for handling suboptimal offline data and adapting to diverse environments.

Paper Structure

This paper contains 60 sections, 6 equations, 12 figures, 18 tables.

Figures (12)

  • Figure 1: Overview of our analysis. We test six methods for learning from offline reward-free trajectories on 23 different datasets across several navigation environments. We evaluate for six generalization properties required to scale to large offline datasets of suboptimal trajectories. We find that planning with a latent dynamics model (PLDM) demonstrates the highest level of generalization. For a full comparison, see \ref{['tab:method_comparison_transposed']}. Right: diagram of PLDM. Circles represent variables, rectangles -- loss components, half-ovals -- trained models.
  • Figure 2: Left: We train offline goal-conditioned agents on trajectories collected in a subset of maze layouts (left), and evaluate on held out layouts, observing trajectories shown on the right. Only PLDM solves the task (see \ref{['fig:sample_trajs']} for more). Right: Success rates of tested methods on held-out layouts, as a function of the number of training layouts. Rightmost plot shows success rates of models trained on data from five layouts, evaluated on held-out layouts ranging from those similar to training layouts to out-of-distribution ones. We use map layout edit distance from the training layouts as a measure of distribution shift. PLDM demonstrates the best generalization performance. Results are averaged over 3 seeds, shaded area denotes standard error. See \ref{['fig:main_idea']} for more details on PLDM.
  • Figure 3: Left: The Two‐Rooms environment. The agent starts at a random location and is tasked with reaching the goal at another randomly sampled location in the other room using 200 steps or less. Observations are $64 \times 64$ pixels images. Right: Examples of trajectories in the offline data. Red: each step's direction is sampled from Von Mises distribution. Blue: each step's direction is sampled uniformly.
  • Figure 4: Testing the selected methods' performance under different dataset constraints. Values and shaded regions are means and standard error over 3 seeds, respectively. Left: To test the importance of the dataset quality, we mix the random policy trajectories with good quality trajectories (see \ref{['fig:env_and_traj']}). As the amount of good quality data goes to 0, methods begin to fail, with PLDM, -NoValue- GCIQL, GCIQL,-NoValue-and HILP being the most robust ones. Center: We measure methods' performance when trained with different sequence lengths. We find that many goal-conditioned methods fail when train trajectories are short, which causes far-away goals to become out-of-distribution for the resulting policy. Right: We measure methods' performance with datasets of varying sizes. We see that PLDM and -NoValue- GCIQL GCIQL-NoValue-are the most sample efficient, and manage to get almost 80% success rate even with a few thousand transitions. See \ref{['sec:pvalues']} for the analysis of statistical significance.
  • Figure 5: Zero-shot generalization to the chasing task.(a) In the chase environment, the blue agent is tasked with avoiding the red chaser. The chaser follows the shortest path to the agent. The observations of the agent remain unchanged: we pass the chaser state as the goal state. The agent has to avoid the specified state instead of reaching it. (b) Left: Performance of the tested methods on the chasing task across different chaser speeds, with faster chaser making the task harder. Baselines include agents that take no action (‘Zero’) and random actions (‘Random’). (b) Right: Average distance between the agent and chaser agent throughout the episode when chaser speed is $1.0$. (c) Visualization of ant-umaze environment. The 4-legged ant is tasked with reaching a randomly sampled goal within a u-shaped room.
  • ...and 7 more figures