Table of Contents
Fetching ...

Temporal Straightening for Latent Planning

Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, Mengye Ren

Abstract

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant -- or even detrimental -- to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straightened latent trajectories, we jointly learn an encoder and a predictor. We show that reducing curvature this way makes the Euclidean distance in latent space a better proxy for the geodesic distance and improves the conditioning of the planning objective. We demonstrate empirically that temporal straightening makes gradient-based planning more stable and yields significantly higher success rates across a suite of goal-reaching tasks.

Temporal Straightening for Latent Planning

Abstract

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant -- or even detrimental -- to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straightened latent trajectories, we jointly learn an encoder and a predictor. We show that reducing curvature this way makes the Euclidean distance in latent space a better proxy for the geodesic distance and improves the conditioning of the planning objective. We demonstrate empirically that temporal straightening makes gradient-based planning more stable and yields significantly higher success rates across a suite of goal-reaching tasks.
Paper Structure (53 sections, 4 theorems, 43 equations, 24 figures, 5 tables)

This paper contains 53 sections, 4 theorems, 43 equations, 24 figures, 5 tables.

Key Result

Theorem 4.4

Under Assumption as:linear with $d_a=d$ and $B$ invertible, unrolling eq:linear yields so $z_K$ is affine in $\mathbf a$ and the planning Hessian is Let $\mathcal{W}_K:=J_\Phi J_\Phi^\top=\sum_{k=0}^{K-1}A^kBB^\top(A^\top)^k$ be the finite-horizon controllability Gramian kailath1980linearsontag1998mathematicalchen1999linear. Then the effective condition number $\kappa_{\mathrm{eff}}(H):=\sigma_{

Figures (24)

  • Figure 1: Latent trajectories encoded by a pretrained visual encoder are usually highly curved, increasing the difficulty of prediction and planning. We learn a representation space where feasible trajectories are straighter.
  • Figure 2: Latent trajectories before vs. after straightening. The upper PushT example is a rotation and the bottom UMaze example shows the agent traveling from the left top to the right top, with the star denoting the target. Straightening yields less curved and smoother trajectories, and makes Euclidean distance a more faithful proxy for geodesic progress towards the goal. More examples are shown in \ref{['app:pca']}.
  • Figure 3: Illustration of Training and Planning. During training, we minimize the prediction loss between the predicted embedding $\hat{z}_{t}^t$ and the target ${z}_{t}^t$ with stop-grad in the target branch, and minimize the local curvature of embeddings. During planning, we rollout for the horizon $T$ using the trained predictor and select optimal actions that minimize the cost between the predicted terminal state and the target in the embedding space.
  • Figure 4: Action-Space Loss Landscape. We pick one test sample from PushT with a planning horizon of 25 steps. For each coordinate $(a_x, a_y)$ in the grid, we fix the first action and optimize the remaining action sequence in the planning horizon to minimize the terminal goal cost. The heatmap represents the minimum attainable loss for each initial action choice, with darker colors indicating lower loss. The loss landscape is closer to convex after straightening.
  • Figure 5: Latent Curvature and Open-Loop GD Success Rate for Different Encoders. Higher cosine similarity indicates lower curvature. Here, we compare models with spatial features and report the average patch-wise cosine similarity. Given the same type of encoder, reduced curvature generally leads to higher success rates.
  • ...and 19 more figures

Theorems & Definitions (15)

  • Definition 4.2: $\varepsilon$-straight transition
  • Remark 4.3: Cosine similarity as a practical proxy
  • Theorem 4.4: Conditioning of the planning Hessian
  • Remark 4.5: Low-dimensional actions
  • Definition C.2: Effective condition number
  • Definition C.3: $\varepsilon$-straight transition
  • Lemma C.4: Hessian form and Gramian equivalence
  • proof
  • Theorem C.5: Conditioning bound
  • proof
  • ...and 5 more