Table of Contents
Fetching ...

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Jaber Jaber, Osama Jaber

Abstract

World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Abstract

World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm

Paper Structure

This paper contains 38 sections, 4 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: HCLSM architecture. Five layers process video into structured world states. The spatial broadcast decoder (SBD, dashed) provides the reconstruction signal that drives slot specialization during Stage 1 training.
  • Figure 2: Spatial broadcast decoder output. Top row: per-slot alpha heatmaps (ownership probability). Bottom row: slot overlays on input frame. Left: segmentation map (argmax over slots). Different colors indicate different slot assignments, showing emerging spatial decomposition.
  • Figure 3: Event detection across four episodes. Blue = event probability over time. Red dashed = detected event boundaries. The model learns to fire at moments of state transition (2--3 per episode).
  • Figure 4: Slot state trajectories projected to 2D via PCA. Circles = start, squares = end. Different colors = different slots. Trajectories show structured temporal dynamics with slot-specific evolution paths.