Table of Contents
Fetching ...

Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

Yuri Calleo

TL;DR

The paper tackles the mismatch between rigorous geostatistical modeling and scalable deep learning for spatio-temporal forecasting by introducing a Spatially-Informed Transformer. It injects a differentiable geostatistical covariance bias, via a learnable Matérn kernel, into the self-attention mechanism, yielding a Geostatistical Attention that decomposes into a stationary prior and a non-stationary neural residual. Through Deep Variography, the model learns the underlying spatial range end-to-end and achieves superior predictive accuracy and well-calibrated probabilistic forecasts on synthetic GRFs and traffic data, while providing interpretable spatial structure in attention maps. The work also presents a robust statistical validation framework (DM test, PIT, Moran’s I) and discusses limitations (notably $O(N^2)$ complexity and stationarity assumptions) with actionable directions for future research, including linear-attention approximations and anisotropic/non-stationary extensions.

Abstract

The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

TL;DR

The paper tackles the mismatch between rigorous geostatistical modeling and scalable deep learning for spatio-temporal forecasting by introducing a Spatially-Informed Transformer. It injects a differentiable geostatistical covariance bias, via a learnable Matérn kernel, into the self-attention mechanism, yielding a Geostatistical Attention that decomposes into a stationary prior and a non-stationary neural residual. Through Deep Variography, the model learns the underlying spatial range end-to-end and achieves superior predictive accuracy and well-calibrated probabilistic forecasts on synthetic GRFs and traffic data, while providing interpretable spatial structure in attention maps. The work also presents a robust statistical validation framework (DM test, PIT, Moran’s I) and discusses limitations (notably complexity and stationarity assumptions) with actionable directions for future research, including linear-attention approximations and anisotropic/non-stationary extensions.

Abstract

The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

Paper Structure

This paper contains 20 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The Geostatistical Attention Mechanism. Visualization of Eq. 8: combining data-driven terms (left) with the geostatistical prior (right).
  • Figure 2: End-to-end learning of spatial covariance. Evolution of the learned range parameter $\hat{\rho}$ (blue line) during training. The parameter converges asymptotically to the true physical range (red dashed line), demonstrating parameter recoverability.
  • Figure 3: Learned Attention Structures (left). Standard Self-Attention learns noisy, long-range correlations symptomatic of overfitting. (Right) Geostatistical Attention enforces a smooth, topology-aware prior consistent with the underlying Gaussian Random Field.
  • Figure 4: Spatial Residual Analysis. The Vanilla Transformer (right) shows clustered errors (high Moran's I), indicating a failure to capture spatial dependencies. The Geo-Transformer (left) achieves spatial whitening (Moran's I $\approx$ 0), effectively removing autocorrelation from the residuals.
  • Figure 5: Probabilistic forecasting performance. One-step ahead prediction for Sensor #201. The model captures the temporal dynamics accurately, with confidence intervals that reflect the true local variance.
  • ...and 2 more figures