Table of Contents
Fetching ...

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

Wendong Yao, Binhua Huang, Soumyabrata Dev

TL;DR

This work tackles the challenge of forecasting high-resolution land subsidence by integrating multi-modal data (dynamic displacement, static physical priors, and temporal cycles) into a joint spatio-temporal Transformer. The proposed MM-STT employs a unified spatio-temporal attention mechanism to fuse modalities and model long-range dependencies, achieving state-of-the-art performance on the EGMS dataset with near-perfect R^2 and substantial RMSE improvements at long horizons. Key contributions include the multi-modal forecasting paradigm, the joint attention architecture, and extensive generalization tests across diverse deformation regimes, demonstrating robust, physically plausible forecasts. The findings underscore the critical importance of deep multi-modal fusion for geophysical forecasting and point to practical applications in infrastructure monitoring and hazard mitigation.

Abstract

Forecasting high-resolution land subsidence is a critical yet challenging task due to its complex, non-linear dynamics. While standard architectures like ConvLSTM often fail to model long-range dependencies, we argue that a more fundamental limitation of prior work lies in the uni-modal data paradigm. To address this, we propose the Multi-Modal Spatio-Temporal Transformer (MM-STT), a novel framework that fuses dynamic displacement data with static physical priors. Its core innovation is a joint spatio-temporal attention mechanism that processes all multi-modal features in a unified manner. On the public EGMS dataset, MM-STT establishes a new state-of-the-art, reducing the long-range forecast RMSE by an order of magnitude compared to all baselines, including SOTA methods like STGCN and STAEformer. Our results demonstrate that for this class of problems, an architecture's inherent capacity for deep multi-modal fusion is paramount for achieving transformative performance.

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

TL;DR

This work tackles the challenge of forecasting high-resolution land subsidence by integrating multi-modal data (dynamic displacement, static physical priors, and temporal cycles) into a joint spatio-temporal Transformer. The proposed MM-STT employs a unified spatio-temporal attention mechanism to fuse modalities and model long-range dependencies, achieving state-of-the-art performance on the EGMS dataset with near-perfect R^2 and substantial RMSE improvements at long horizons. Key contributions include the multi-modal forecasting paradigm, the joint attention architecture, and extensive generalization tests across diverse deformation regimes, demonstrating robust, physically plausible forecasts. The findings underscore the critical importance of deep multi-modal fusion for geophysical forecasting and point to practical applications in infrastructure monitoring and hazard mitigation.

Abstract

Forecasting high-resolution land subsidence is a critical yet challenging task due to its complex, non-linear dynamics. While standard architectures like ConvLSTM often fail to model long-range dependencies, we argue that a more fundamental limitation of prior work lies in the uni-modal data paradigm. To address this, we propose the Multi-Modal Spatio-Temporal Transformer (MM-STT), a novel framework that fuses dynamic displacement data with static physical priors. Its core innovation is a joint spatio-temporal attention mechanism that processes all multi-modal features in a unified manner. On the public EGMS dataset, MM-STT establishes a new state-of-the-art, reducing the long-range forecast RMSE by an order of magnitude compared to all baselines, including SOTA methods like STGCN and STAEformer. Our results demonstrate that for this class of problems, an architecture's inherent capacity for deep multi-modal fusion is paramount for achieving transformative performance.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: This figure shows a brief introduction on how MM-STT is being processed. The process is divided into 2 parts: the data processing part and the model training part. The block in light gray shows how the original EGMS CSV-formatted data is being processed. Feature Extraction, Spatial Rasterization, and Temporal Feature Engineering are processed in a pipeline to extract both static and dynamic data. After that, the extracted input data is combined via a Multi-Channel Data Fusion process to make it an Input Sequence Multi-modal Data Cube. The cube will be taken as the form of input data in the data loader and will be loaded into our spatiotemporal transformer model for training and prediction. The output of the model will be predicted displacement maps and predicted time series.
  • Figure 2: Qualitative comparison of 10-step-ahead forecasts for selected nodes. For each forecast graph, the x-axis represents the number of prediction steps, and the y-axis represents the displacement value. Our proposed MM-STT (blue solid line) demonstrates a superior ability to track the ground truth compared to all baseline models.
  • Figure 3: Integrated qualitative and quantitative comparison of predicted displacement maps. Each predicted map is annotated with its Structural Similarity Index Measure (SSIM) and Correlation score relative to the ground truth. The consistently perfect scores of our MM-STT model provide objective validation of its superior spatial fidelity.
  • Figure 4: In-depth statistical performance comparison. From left to right: (a) Scatter plot (true vs. predicted); (b) Residual plot; (c) Binned MAE; (d) Binned residuals boxplot. Each row corresponds to a different model. The unified y-axis scale highlights the transformative improvement of our MM-STT model.
  • Figure 5: Qualitative comparison of predicted displacement maps at the t+10 forecast horizon across six diverse regions. Each sub-figure compares the Ground Truth (left) against our MM-STT prediction (right).
  • ...and 1 more figures