Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Lance Legel; Qin Huang; Brandon Voelker; Daniel Neamati; Patrick Alan Johnson; Favyen Bastani; Jeff Rose; James Ryan Hennessy; Robert Guralnick; Douglas Soltis; Pamela Soltis; Shaowen Wang

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang

TL;DR

DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder is presented, demonstrating Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark.

Abstract

We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

TL;DR

Abstract

Paper Structure (10 sections, 5 figures, 1 table)

This paper contains 10 sections, 5 figures, 1 table.

DeepEarth Architecture
Earth4D Architecture
Earth4D Experimental Validation
Live Fuel Moisture Content Prediction
Earth4D Resolution Specifications
Learned Hash Probing and Ablation Studies
Hash Collision Simulations
Performance Improvements
Benchmark Specifications
Galileo Baseline Model

Figures (5)

Figure 1: DeepEarth Overview. Masked multi-modal data (e.g. images, text) sampled around an event (e.g. pollination) are encoded and fused with Earth4D space-time embeddings. These universal tokens are jointly encoded, and then masked data is inductively decoded and simulated.
Figure 2: Earth4D Space-Time Positional Encoding. A planetary-scale 4D encoder with fully decomposable spatio-temporal representation. Four grids (xyz, xyt, yzt, xzt) are each learned in 3D space and computed in parallel. Each grid has multiple resolution levels (Appendix \ref{['appendix:resolution']}), enabling deep learning of complex joint distributions in multi-modal data across space-time scales.
Figure 3: Earth4D LFMC Prediction Performance.(Top) Distribution of absolute errors in percentage point predictions across 13,297 test samples, showing median error of 7.1pp. (A) Geographic error distribution across CONUS shows low error in well-sampled regions. (B) Temporal predictions closely track ground truth LFMC measurements across seasons (2017--2023).
Figure 4: Earth4D Space-Time Scales. Default 24$\times$24$\times$24 levels for each xyz, xyt, yzt, xzt grid. Each level stores up to $2^{22}$ entries, with each entry storing a 2D feature. Requires 724M trainable parameters ($\sim$11 GB GPU memory during training). Parallelizable across levels and spatio-temporal boundaries. Outputs 192D per $(x,y,z,t)$ coordinate from 4 grids $\times$ 24 levels $\times$ 2D feature per level. Hashing saves memory vs. naive requirement, e.g., $(2^{28})^3 = 10^{25}$ at level 24.
Figure 5: Earth4D Hash Collision Analysis.(Table) 10 $(x,y,z,t)$ point distribution scenarios that were simulated to analyze hash collisions in Earth4D memory. (Graph) Shows results for 1M point simulations across all 24 levels.

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

TL;DR

Abstract

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Authors

TL;DR

Abstract

Table of Contents

Figures (5)