Table of Contents
Fetching ...

Planning as Descent: Goal-Conditioned Latent Trajectory Synthesis in Learned Energy Landscapes

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

TL;DR

PaD reframes offline goal-conditioned planning as gradient-based refinement in a learned energy landscape over latent trajectories, enabling verification-driven trajectory synthesis without explicit policy or value learning. By training the energy model with hindsight goal relabeling and enforcing training–inference alignment through shared refinement dynamics, PaD produces coherent, goal-directed plans even from reward-free, suboptimal data. At inference, multiple time-to-reach hypotheses are refined in parallel, and low-energy plans are selected to balance feasibility and efficiency; action decoding is handled separately via inverse dynamics. Empirically, PaD achieves state-of-the-art performance on OGBench cube tasks, and surprisingly, training on diverse but suboptimal data further improves planning efficiency and robustness, highlighting the value of data diversity for verification-driven planning.

Abstract

We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95\% success, strongly outperforming prior methods that peak at 68\%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.

Planning as Descent: Goal-Conditioned Latent Trajectory Synthesis in Learned Energy Landscapes

TL;DR

PaD reframes offline goal-conditioned planning as gradient-based refinement in a learned energy landscape over latent trajectories, enabling verification-driven trajectory synthesis without explicit policy or value learning. By training the energy model with hindsight goal relabeling and enforcing training–inference alignment through shared refinement dynamics, PaD produces coherent, goal-directed plans even from reward-free, suboptimal data. At inference, multiple time-to-reach hypotheses are refined in parallel, and low-energy plans are selected to balance feasibility and efficiency; action decoding is handled separately via inverse dynamics. Empirically, PaD achieves state-of-the-art performance on OGBench cube tasks, and surprisingly, training on diverse but suboptimal data further improves planning efficiency and robustness, highlighting the value of data diversity for verification-driven planning.

Abstract

We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95\% success, strongly outperforming prior methods that peak at 68\%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.

Paper Structure

This paper contains 50 sections, 12 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of the Planning as Descent (PaD) learning framework. Given trajectory states and a hindsight-relabeled goal $(s_g, \lambda)$, states are independently encoded into latent representations using $f_\theta$. Future latents are corrupted to form an initial trajectory $z_{\mathrm{future}}^{0}$, which is iteratively refined by descending the conditional energy $E_\theta$ and projecting updates back onto the encoder-induced manifold through $p_\theta$. At each refinement step, a denoising loss compares the intermediate trajectory to the clean future latents while stop-gradient operations prevent (i) mode collapse and (ii) backpropagation through the refinement dynamics.
  • Figure 2: Training loss on single-cube-noisy-v0 when ablating the manifold projector. The lightweight 130K-parameter projector substantially stabilizes training and improves convergence with negligible computational overhead.
  • Figure 3: Multi-hypothesis planning and temporal target selection in PaD.Left: As task execution progresses, PaD naturally selects time-to-reach hypotheses ($\lambda$) that correspond to decreasing distances to the goal, reflecting adaptive planning as the agent approaches completion. Right: Example timesteps during the rollout. The top row displays the corresponding environment states at each step. The bottom row shows the distribution of sampled temporal targets and their associated energies, with lower energies indicating more plausible plans for reaching the goal.
  • Figure 4: Cube-single tasks.
  • Figure 5: Episode-length distributions for cube-single tasks. Histogram of episode lengths over successful rollouts for PaD-play and PaD-noisy, aggregated across all five cube-single manipulation tasks. PaD-noisy exhibits a clear left-shifted distribution, indicating consistently shorter plans compared to PaD-play. This confirms that the efficiency gains observed in mean episode length are systematic and not driven by outliers.
  • ...and 1 more figures