Table of Contents
Fetching ...

FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models

Fabien Polly

Abstract

World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.

FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models

Abstract

World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.
Paper Structure (85 sections, 10 equations, 28 figures, 7 tables, 1 algorithm)

This paper contains 85 sections, 10 equations, 28 figures, 7 tables, 1 algorithm.

Figures (28)

  • Figure 1: FluidWorld architecture overview.Top row: The prediction pipeline. An input frame $x_t$ (64$\times$64 pixels) is encoded by three PDE-based layers that use Laplacian diffusion to extract spatial features. These features are written into a persistent BeliefField, a 16$\times$16 latent state that accumulates temporal context. The BeliefField evolves via internal PDE dynamics to predict the next state, which the decoder converts back to pixels. During autoregressive rollout, each prediction becomes the next input (purple arrow). Bottom row: Key components. The Laplacian kernel ($[1, -2, 1]$) provides spatial information propagation at $O(N)$ cost without attention. The BeliefField combines GRU-gated writing with PDE evolution and Titans persistent memory. Biologically-inspired mechanisms (lateral inhibition, Hebbian diffusion, synaptic fatigue) promote diverse, structured representations. The entire model uses ${\sim}$801K parameters (Table \ref{['tab:params']}).
  • Figure 2: How Laplacian diffusion enables self-repair.Top row (a--d): The heat equation analogy. A concentrated energy spike (a) spreads via diffusion (b) until reaching equilibrium (c). The Laplacian kernel $[1, -2, 1]$ (d) computes the difference between a position and its neighbors; this single operation drives all spatial propagation in FluidWorld. Bottom row (e--h): Application to prediction errors. A clean prediction (e) accumulates noise during autoregressive rollout (f). The Laplacian smooths away high-frequency errors (g), recovering coherent spatial structure. Panel (h) summarizes why this matters: Transformers have no diffusion mechanism (errors compound), ConvLSTMs have only local kernels (slow dissipation), but FluidWorld's Laplacian provides global error correction at every integration step.
  • Figure 3: Training dynamics comparison (three-way). (a) Reconstruction loss: PDE and ConvLSTM converge to $2\times$ lower error than Transformer. (b) Prediction loss: all three converge to comparable values. (c) Spatial Std: PDE maintains the highest spatial structure throughout training. (d) Effective Rank: PDE uses the most representational dimensions. All models have $\sim$800K parameters, identical data and losses.
  • Figure 4: Final comparison at step 8,000. Single-step metrics are comparable across all three models ($\sim$800K parameters each). The PDE achieves the highest spatial structure and effective dimensionality.
  • Figure 5: PDE diffusion scales $O(N)$; attention scales $O(N^2)$. At $128\!\times\!128$ the ratio exceeds $16{,}000\times$.
  • ...and 23 more figures