Table of Contents
Fetching ...

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results

Carlos Núñez-Molina, Vicenç Gómez, Hector Geffner

TL;DR

This work investigates whether transformer-based next-token prediction can yield accurate, interpretable world-models in propositional STRIPS domains by learning from action traces alone. It introduces the STRIPS Transformer, a differentiable architecture that mimics B-RASP computations and can recover an exact STRIPS model $M$ from labeled traces, yielding a symbolic model $M_{\bar{\theta}}$ when binarized. The approach demonstrates strong generalization: with sufficient, diverse training traces, the learned domain matches the ground-truth domain and generalizes to longer unseen traces, across multiple domains. The study bridges symbolic planning and neural sequence models, offering a pathway to learn domain-independent planning models from data and highlighting interpretability and potential extensions to lifted STRIPS domains and multimodal inputs.

Abstract

We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action $a$ may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of $a$ false. We show that a suitable transformer architecture can faithfully represent propositional STRIPS world models, and that the models can be learned from sets of random valid (positive) and invalid (negative) action sequences alone. A number of experiments are reported.

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results

TL;DR

This work investigates whether transformer-based next-token prediction can yield accurate, interpretable world-models in propositional STRIPS domains by learning from action traces alone. It introduces the STRIPS Transformer, a differentiable architecture that mimics B-RASP computations and can recover an exact STRIPS model from labeled traces, yielding a symbolic model when binarized. The approach demonstrates strong generalization: with sufficient, diverse training traces, the learned domain matches the ground-truth domain and generalizes to longer unseen traces, across multiple domains. The study bridges symbolic planning and neural sequence models, offering a pathway to learn domain-independent planning models from data and highlighting interpretability and potential extensions to lifted STRIPS domains and multimodal inputs.

Abstract

We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of false. We show that a suitable transformer architecture can faithfully represent propositional STRIPS world models, and that the models can be learned from sets of random valid (positive) and invalid (negative) action sequences alone. A number of experiments are reported.

Paper Structure

This paper contains 19 sections, 3 theorems, 10 equations, 3 figures, 1 table.

Key Result

Theorem 3

Let $\tau$ be an action sequence drawn from $M$. Then, $f_M^\textsc{B-RASP}\xspace(\tau)=f_M(\tau)$.

Figures (3)

  • Figure 1: Vectors produced by the B-RASP program $f_M^\textsc{B-RASP}\xspace$ for the two traces $\tau^+$ and $\tau^-$ shown on top, drawn from the model $M$ shown on the left, named $\texttt{simple}\xspace$. The value of $f_M^\textsc{B-RASP}\xspace(\tau)$ is given by the last entry of the last vector, i.e., $Z(n=6)$ (marked in bold), and for both traces, $f_M^\textsc{B-RASP}\xspace(\tau)=f_M(\tau)$. (a) The hidden strips domain simple. (b) Vectors produced for computing $f_M^\textsc{B-RASP}\xspace(\tau^+)$. (c) Vectors produced for computing $f_M^\textsc{B-RASP}\xspace(\tau^-)$. The trace $\tau^-$ contains two inapplicable actions (marked in red): the second occurrence of $\texttt{a}$, for which the precondition $p$ is false (see how $Y_p(3)=1$), and the last occurrence of $\texttt{b}$, for which the preconditions $q$ and $r$ are false (see how $Y_q(6)=Y_r(6)=1$). Finally, we observe that $Y(3)=1$ and $Y(6)=1$, meaning that $a_3=a$ and $a_6=b$ are inapplicable actions, so $\tau^-$ is a negative trace ($Y(6)=1$).
  • Figure 2: Examples of domains learned in simple for the training dataset with 200 samples. For comparison purposes, we renamed the propositions to p, q, r as in Figure \ref{['fig:brasp-example']}. Incorrect preconditions, add effects and delete effects are highlighted in color: red for excess propositions and gray for those that are absent. (a) shows the domain learned by the strips Transformer in 9 out of 10 seeds, which is equivalent to the hidden, ground-truth domain. As a result, it obtains both perfect training and test accuracy. (b) shows the domain learned for seed 5. This domain contains several incorrect preconditions and effects, thus resulting in 80% training accuracy and 83% test accuracy.
  • Figure 3: Self-attention computations for (a) the valid trace $(\texttt{a},\texttt{c},\texttt{c},\texttt{b},\texttt{c},\texttt{a})$ and (b) the invalid trace $(\texttt{a},\texttt{c},\texttt{a},\texttt{c},\texttt{b},\texttt{b})$ in the strips domain simple (see Figure \ref{['fig:brasp-example']}) using the optimal parameters $\theta^*$. Each attention head $\text{att}_{p_l}$, where $p_l \in \{\texttt{p},\texttt{q},\texttt{r}\}$ shows the score matrix $S_{p_l}=Q_{p_l} \cdot K_{p_l}^\top$ and the strict future masking operation (blank cells are masked). Since every score $S_{p_l}(i,j)$ is either 0 or 1, after applying the stick-breaking normalization each row $i$ contains a single normalized score $S'_{p_l}(i,j)$ equal to one (highlighted in gray), whereas the rest are set to zero (non-gray cells). Therefore, when scores are 0 or 1, stick-breaking attention is equivalent to hard attention. Finally, we show the output $y_{p_l}=S'_{p_l} \cdot V_{p_l}$ of each attention head, where $y_{p_l}(i) = 0$ if $a_i$ is applicable according to $p_l$ and otherwise $y_{p_l}(i) = 1$. For the invalid trace $(\texttt{a},\texttt{c},\texttt{a},\texttt{c},\texttt{b},\texttt{b})$ in (b) we can see that $y_\texttt{p}(3)=1$, meaning that $a_3=\texttt{a}$ is inapplicable according to $\texttt{p}$; also, we observe that $y_\texttt{q}(6)=y_\texttt{r}(6)=1$, meaning that $a_6=\texttt{b}$ is inapplicable according to both $\texttt{q}$ and $\texttt{r}$. Therefore, the trace is negative.

Theorems & Definitions (6)

  • Definition 1: Positive and negative traces
  • Definition 2: Learning task
  • Theorem 3
  • Theorem 4: From strips to Transformer
  • Definition 5
  • Theorem 6: From Transformer to strips