Table of Contents
Fetching ...

Textual Planning with Explicit Latent Transitions

Eliezer Shlomi, Ido Levy, Eilam Shapira, Michael Katz, Guy Uziel, Segev Shlomov, Nir Mashkif, Roi Reichart, Sarah Keren

TL;DR

EmbedPlan replaces token-by-token LLM planning with a lightweight transition model operating in frozen language embeddings, predicting the next-state embedding $\

Abstract

Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.

Textual Planning with Explicit Latent Transitions

TL;DR

EmbedPlan replaces token-by-token LLM planning with a lightweight transition model operating in frozen language embeddings, predicting the next-state embedding $\

Abstract

Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.
Paper Structure (117 sections, 28 equations, 5 figures, 25 tables)

This paper contains 117 sections, 28 equations, 5 figures, 25 tables.

Figures (5)

  • Figure 1: EmbedPlan: Latent Transition Learning.(A) Planning domains define state transitions as linguistic triplets $(s, a, s')$: the Blocksworld action pick-up(C) transforms the textual state description from "arm is empty, C is clear" to "arm is holding C." (B)EmbedPlan encodes states and actions via frozen LLM embeddings, then trains a lightweight transition network $T_\theta$ to predict next-state embeddings. Contrastive learning (InfoNCE) pulls predictions toward ground-truth states while pushing away negatives; inference retrieves the nearest candidate.
  • Figure 2: PCA of sampled transitions from three domains. States $s$ ($\circ$), predictions $\hat{s}'$ ($\triangle$), and ground-truth $s'$ ($\square$); dashed lines link each prediction to its source state. Interpolation, BGE-M3.
  • Figure 3: Extrapolation by Domain (Llama-3.3-70B). Hit@5 (%) under Problem-Grouped evaluation. Dashed line: untrained baseline (3.9%). Error bars: $\pm$ SE.
  • Figure 4: Complete EmbedPlan Architecture. State and action descriptions are encoded by a frozen LLM encoder $E$ into high-dimensional embeddings $\mathbf{z}_s, \mathbf{z}_a$. Learned projection heads $\pi_s, \pi_a$ reduce dimensionality to a shared 128-d space. The transition network $T_\theta$ (with residual connection from $\mathbf{h}_s$) predicts the next-state embedding $\hat{\mathbf{h}}_{s'}$, trained via InfoNCE to maximize similarity to the ground-truth embedding. At inference, the model retrieves the most similar state from a candidate pool.
  • Figure 5: Embedding Space Fragmentation Across Scales. PCA visualization of state embeddings colored by problem instance. (A) MPNet embeddings show tight, isolated clusters for each problem. (B) Llama-3.3-70B-70B embeddings, despite being 700$\times$ larger, exhibit the same fragmentation. This confirms that pre-trained embeddings primarily cluster by problem-specific lexical features rather than abstract planning roles, regardless of model scale.