Table of Contents
Fetching ...

Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

TL;DR

This survey formalizes the spatio-temporal fusion task for Land Surface Temperature (LST) and surveys deep learning approaches across CNNs, autoencoders, GANs, and transformers, highlighting how LST’s rapid temporal dynamics and sharp spatial gradients challenge SR-based STF methods. It introduces STF-LST, a public MODIS-Landsat LST dataset (51 pairs, 2013–2024) to benchmark methods and reveal gaps in current DL architectures, including generalization, cloud-gap handling, and physical consistency. The study finds that DL methods initially designed for SR struggle to generalize to LST, with average RMSE often exceeding 3°C and notable artifacts or oversmoothing, underscoring the need for joint spatio-temporal models, robust gap handling, and physics-informed losses. The work outlines future directions, including unified spatio-temporal networks, pretrained fusion models, higher-resolution guidance, and potential LLM-assisted semantic augmentation, to advance practically reliable LST fusion for climate and urban applications.

Abstract

Land Surface Temperature (LST) plays a key role in climate monitoring, urban heat assessment, and land-atmosphere interactions. However, current thermal infrared satellite sensors cannot simultaneously achieve high spatial and temporal resolution. Spatio-temporal fusion (STF) techniques address this limitation by combining complementary satellite data, one with high spatial but low temporal resolution, and another with high temporal but low spatial resolution. Existing STF techniques, from classical models to modern deep learning (DL) architectures, were primarily developed for surface reflectance (SR). Their application to thermal data remains limited and often overlooks LST-specific spatial and temporal variability. This study provides a focused review of DL-based STF methods for LST. We present a formal mathematical definition of the thermal fusion task, propose a refined taxonomy of relevant DL methods, and analyze the modifications required when adapting SR-oriented models to LST. To support reproducibility and benchmarking, we introduce a new dataset comprising 51 Terra MODIS-Landsat LST pairs from 2013 to 2024, and evaluate representative models to explore their behavior on thermal data. The analysis highlights performance gaps, architecture sensitivities, and open research challenges. The dataset and accompanying resources are publicly available at https://github.com/Sofianebouaziz1/STF-LST.

Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

TL;DR

This survey formalizes the spatio-temporal fusion task for Land Surface Temperature (LST) and surveys deep learning approaches across CNNs, autoencoders, GANs, and transformers, highlighting how LST’s rapid temporal dynamics and sharp spatial gradients challenge SR-based STF methods. It introduces STF-LST, a public MODIS-Landsat LST dataset (51 pairs, 2013–2024) to benchmark methods and reveal gaps in current DL architectures, including generalization, cloud-gap handling, and physical consistency. The study finds that DL methods initially designed for SR struggle to generalize to LST, with average RMSE often exceeding 3°C and notable artifacts or oversmoothing, underscoring the need for joint spatio-temporal models, robust gap handling, and physics-informed losses. The work outlines future directions, including unified spatio-temporal networks, pretrained fusion models, higher-resolution guidance, and potential LLM-assisted semantic augmentation, to advance practically reliable LST fusion for climate and urban applications.

Abstract

Land Surface Temperature (LST) plays a key role in climate monitoring, urban heat assessment, and land-atmosphere interactions. However, current thermal infrared satellite sensors cannot simultaneously achieve high spatial and temporal resolution. Spatio-temporal fusion (STF) techniques address this limitation by combining complementary satellite data, one with high spatial but low temporal resolution, and another with high temporal but low spatial resolution. Existing STF techniques, from classical models to modern deep learning (DL) architectures, were primarily developed for surface reflectance (SR). Their application to thermal data remains limited and often overlooks LST-specific spatial and temporal variability. This study provides a focused review of DL-based STF methods for LST. We present a formal mathematical definition of the thermal fusion task, propose a refined taxonomy of relevant DL methods, and analyze the modifications required when adapting SR-oriented models to LST. To support reproducibility and benchmarking, we introduce a new dataset comprising 51 Terra MODIS-Landsat LST pairs from 2013 to 2024, and evaluate representative models to explore their behavior on thermal data. The analysis highlights performance gaps, architecture sensitivities, and open research challenges. The dataset and accompanying resources are publicly available at https://github.com/Sofianebouaziz1/STF-LST.

Paper Structure

This paper contains 40 sections, 14 equations, 12 figures, 13 tables.

Figures (12)

  • Figure S1: Yearly literature count related to STF for LST estimation indexed by Google Scholar since 2015. The search query also covered the synonyms of STF, including data and image fusion.
  • Figure S2: Satellite-derived LST, inspired by li2023satellite. $T_{s i}$, $\varepsilon_i$, and $a_i$ represent the surface temperature, emissivity, and projected area weight for the $i$-th visible component, respectively. $\theta_v$ is the view zenith angle, and $\varphi_v$ is the viewing azimuth angle.
  • Figure S3: Graphic representation of the STF for LST estimation. $X_1$ denotes data from the MODIS Terra satellite, and $X_2$ refers to data from the Landsat 8 satellite. The ROI, $s$, corresponds to Orléans Métropole, in France. The time steps $t_1$, $t_2$, and $t_3$ represent 6 Mar, 22 Mar, and 9 May 2022, respectively.
  • Figure S4: Proposed Taxonomy for DL-Based STF Methods based on Architecture, Learning Paradigm, Training Strategy, and Incorporation of Pre-trained Models. Methods originally developed for related tasks (e.g., SR, NDVI) are included to highlight transferable design patterns relevant to LST estimation.
  • Figure S5: Typical architecture of a CNN-based STF method using a single pair of images, $P_1$, composed of four main blocks: spatial feature extraction, temporal variation extraction, fusion of spatial and temporal representations, and satellite image reconstruction. The fusion rule can be element-wise addition, multiplication, concatenation, attention, etc.
  • ...and 7 more figures