Table of Contents
Fetching ...

UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover

Arya Chavoshi, Hassan Dashtian, Naveen Sudharsan, Dev Niyogi

TL;DR

Cloud-contaminated LST hinders continuous SUHI monitoring, motivating UrbanDIFF, a purely spatial diffusion-based gap-filling method conditioned on static urban structure and elevation. The approach uses DDPM with pixel-guided refinement and RePaint-style projection to enforce consistency with revealed pixels, trained on MODIS Terra LST across seven US metros (2002–2025). Synthetic cloud tests show UrbanDIFF outperforms a baseline interpolation, especially under dense occlusion, with robust SUHI estimation and cross-city consistency. While not a full operational replacement for spatiotemporal or multi-sensor methods, UrbanDIFF provides a strong methodological foundation for purely spatial LST reconstruction and future extensions that incorporate temporal context and uncertainty handling.

Abstract

Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.

UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover

TL;DR

Cloud-contaminated LST hinders continuous SUHI monitoring, motivating UrbanDIFF, a purely spatial diffusion-based gap-filling method conditioned on static urban structure and elevation. The approach uses DDPM with pixel-guided refinement and RePaint-style projection to enforce consistency with revealed pixels, trained on MODIS Terra LST across seven US metros (2002–2025). Synthetic cloud tests show UrbanDIFF outperforms a baseline interpolation, especially under dense occlusion, with robust SUHI estimation and cross-city consistency. While not a full operational replacement for spatiotemporal or multi-sensor methods, UrbanDIFF provides a strong methodological foundation for purely spatial LST reconstruction and future extensions that incorporate temporal context and uncertainty handling.

Abstract

Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.

Paper Structure

This paper contains 28 sections, 22 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Methodological overview of UrbanDIFF. (a) Schematic of the overall inference pipeline, illustrating the iterative denoising process combined with pixel-level supervised refinement across diffusion timesteps. At each timestep, denoising steps are interleaved with gradient-based updates that enforce consistency with revealed (cloud-free) pixels. (b) Details of the supervised pixel-guided refinement step, including the loss function defined on revealed pixels and the corresponding gradient update applied before each denoising iteration. (C) Conditional denoising step of the diffusion process, showing the probabilistic formulation of the reverse diffusion transition.
  • Figure 2: Overview of the datasets used in this study, showing MODIS LST for selected urban regions and the corresponding static conditioning variables (built-up surface and DEM).
  • Figure 3: Effect of hyperparameter variations on UrbanDIFF's normalized performance score. The normalized score is computed as a weighted average of normalized SSIM, RMSE, $R^2$, SUHI error, and per-image inference time. (a) Heatmap of the normalized score across synthetic cloud conditions, defined by cloud coverage (cc) and cloud octaves, as a function of the number of denoising timesteps ($T$) and the guidance stride ($\tau$). (b) Influence of the total number of supervised gradient steps ($T/\tau$) on the normalized score for three representative cloud coverage levels.
  • Figure 4: Influence of synthetic cloud coverage on the gap-filling performance of UrbanDIFF and the baseline interpolation model, evaluated using SSIM, RMSE, $R^2$, and SUHI error.
  • Figure 5: Influence of synthetic cloud mask octaves (mask density) on the gap-filling performance of UrbanDIFF and the baseline interpolation model, evaluated using SSIM, RMSE, $R^2$, and SUHI error.
  • ...and 2 more figures