Table of Contents
Fetching ...

Self-Improving World Modelling with Latent Actions

Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

TL;DR

SWIRL presents a self-improving framework for intrinsic world modelling by treating actions as latent variables and iteratively optimising a Forward World Model and an Inverse Dynamics Model via Group Relative Policy Optimisation. The approach is grounded in a variational bound on the conditional mutual information and an ELBO, providing learnability guarantees for both phases. Empirically, SWIRL yields consistent gains across open-world visual dynamics, long-horizon prediction, and textual tool/interaction environments, often approaching or matching larger, more supervised models while using unlabelled state sequences. This demonstrates data-efficient self-improvement of internal world models with broad applicability to reasoning and planning in multimodal agents, though it also highlights safety considerations for autonomous web and tool interactions.

Abstract

Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_θ(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_φ(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

Self-Improving World Modelling with Latent Actions

TL;DR

SWIRL presents a self-improving framework for intrinsic world modelling by treating actions as latent variables and iteratively optimising a Forward World Model and an Inverse Dynamics Model via Group Relative Policy Optimisation. The approach is grounded in a variational bound on the conditional mutual information and an ELBO, providing learnability guarantees for both phases. Empirically, SWIRL yields consistent gains across open-world visual dynamics, long-horizon prediction, and textual tool/interaction environments, often approaching or matching larger, more supervised models while using unlabelled state sequences. This demonstrates data-efficient self-improvement of internal world models with broad applicability to reasoning and planning in multimodal agents, though it also highlights safety considerations for autonomous web and tool interactions.

Abstract

Internal modelling of the world -- predicting transitions between previous states and next states under actions -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) and an Inverse Dynamics Modelling (IDM) . SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
Paper Structure (47 sections, 4 theorems, 18 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 4 theorems, 18 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.1

Optimising the FWM to maximise the log-probability assigned to generated samples by the frozen IDM maximises a variational lower bound on the Conditional Mutual Information $I_{\tilde{P}}(Z; \hat{Y} | X)$ defined over the empirical belief distribution $\tilde{P}(z|x)$.

Figures (7)

  • Figure 1: In SWIRL (Self-improving World modelling with Iterative RL), we facilitate the world modelling ability of foundation models (LLMs and VLMs) by modelling two components: Forward World Model (FWM) $P_\theta(y \mid x, z)$ and Inverse Dynamics Model (IDM) $Q_\phi(z \mid x, y)$. These components are iteratively optimised through RL (specifically, GRPO) in two distinct phases: I) the FDM acts as a policy and the IDM as a reward to ensure identifiability between actions and next states; II) the IDM acts as a policy and the FDM as a reward to ensure data fidelity to the state-only sequences. The KL term is omitted from the figure for simplicity.
  • Figure 2: Performance of FWM and IDM in each iteration of SWIRL. We visualise the training dynamics across iterations for two settings: maintaining separate weights (Left) versus sharing parameters (Right). We present the evaluation performance on top, and the training rewards in the bottom. We present only the first iteration's reward curves for brevity.
  • Figure 3: We compare our proposed SWIRL against SFT baselines (continual training and merging) across five benchmarks in Aurora-Bench. The x-axis represents the number of training samples. Our method (solid blue line) demonstrates superior data efficiency, achieving higher GPT-4o evaluation scores.
  • Figure 4: Prompt template used to instruct GPT-4o to evaluate multi-turn visual dynamics prediction in the WorldPrediction benchmark.
  • Figure 5: Qualitative of SWIRL for general image editing including Subject's add/replace/remove, style transfer, altering the colour or background.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Theorem 3.1: FWM Lower Bound
  • proof
  • Theorem 3.2: IDM Lower Bound
  • proof
  • Theorem 1.1
  • proof
  • Theorem 1.2
  • proof