Table of Contents
Fetching ...

MoCoLSK: Modality Conditioned High-Resolution Downscaling for Land Surface Temperature

Qun Dai, Chunyang Yuan, Yimian Dai, Yuxuan Li, Xiang Li, Kang Ni, Jianhui Xu, Xiangbo Shu, Jian Yang

TL;DR

This work tackles high-resolution LST downscaling by addressing spatial non-stationarity and the lack of open-source tooling. It introduces MoCoLSK-Net, a modality-conditioned dynamic fusion network that combines a Large Selective Kernel pathway with a modality-conditioned weight generator to adapt receptive fields for multi-modal guidance data. The authors also launch GrokLST, an open-source ecosystem comprising a 30 m HR LST benchmark (GrokLST dataset) and a PyTorch toolkit with 40+ baselines to standardize evaluation. On the GrokLST dataset, MoCoLSK-Net achieves state-of-the-art performance across multiple downscaling factors, confirming its ability to capture complex, texture-rich relationships in multispectral data. Overall, the work provides a practical, reproducible pathway for high-resolution LST retrieval with significant implications for environmental monitoring and climate studies.

Abstract

Land Surface Temperature (LST) is a critical parameter for environmental studies, but directly obtaining high spatial resolution LST data remains challenging due to the spatio-temporal trade-off in satellite remote sensing. Guided LST downscaling has emerged as an alternative solution to overcome these limitations, but current methods often neglect spatial non-stationarity, and there is a lack of an open-source ecosystem for deep learning methods. In this paper, we propose the Modality-Conditional Large Selective Kernel (MoCoLSK) Network, a novel architecture that dynamically fuses multi-modal data through modality-conditioned projections. MoCoLSK achieves a confluence of dynamic receptive field adjustment and multi-modal feature fusion, leading to enhanced LST prediction accuracy. Furthermore, we establish the GrokLST project, a comprehensive open-source ecosystem featuring the GrokLST dataset, a high-resolution benchmark, and the GrokLST toolkit, an open-source PyTorch-based toolkit encapsulating MoCoLSK alongside 40+ state-of-the-art approaches. Extensive experimental results validate MoCoLSK's effectiveness in capturing complex dependencies and subtle variations within multispectral data, outperforming existing methods in LST downscaling. Our code, dataset, and toolkit are available at https://github.com/GrokCV/GrokLST.

MoCoLSK: Modality Conditioned High-Resolution Downscaling for Land Surface Temperature

TL;DR

This work tackles high-resolution LST downscaling by addressing spatial non-stationarity and the lack of open-source tooling. It introduces MoCoLSK-Net, a modality-conditioned dynamic fusion network that combines a Large Selective Kernel pathway with a modality-conditioned weight generator to adapt receptive fields for multi-modal guidance data. The authors also launch GrokLST, an open-source ecosystem comprising a 30 m HR LST benchmark (GrokLST dataset) and a PyTorch toolkit with 40+ baselines to standardize evaluation. On the GrokLST dataset, MoCoLSK-Net achieves state-of-the-art performance across multiple downscaling factors, confirming its ability to capture complex, texture-rich relationships in multispectral data. Overall, the work provides a practical, reproducible pathway for high-resolution LST retrieval with significant implications for environmental monitoring and climate studies.

Abstract

Land Surface Temperature (LST) is a critical parameter for environmental studies, but directly obtaining high spatial resolution LST data remains challenging due to the spatio-temporal trade-off in satellite remote sensing. Guided LST downscaling has emerged as an alternative solution to overcome these limitations, but current methods often neglect spatial non-stationarity, and there is a lack of an open-source ecosystem for deep learning methods. In this paper, we propose the Modality-Conditional Large Selective Kernel (MoCoLSK) Network, a novel architecture that dynamically fuses multi-modal data through modality-conditioned projections. MoCoLSK achieves a confluence of dynamic receptive field adjustment and multi-modal feature fusion, leading to enhanced LST prediction accuracy. Furthermore, we establish the GrokLST project, a comprehensive open-source ecosystem featuring the GrokLST dataset, a high-resolution benchmark, and the GrokLST toolkit, an open-source PyTorch-based toolkit encapsulating MoCoLSK alongside 40+ state-of-the-art approaches. Extensive experimental results validate MoCoLSK's effectiveness in capturing complex dependencies and subtle variations within multispectral data, outperforming existing methods in LST downscaling. Our code, dataset, and toolkit are available at https://github.com/GrokCV/GrokLST.
Paper Structure (33 sections, 19 equations, 10 figures, 11 tables)

This paper contains 33 sections, 19 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Selection area and representative images of our GrokLST dataset. As demonstrated, upon reaching a resolution of 30 meters or lower, numerous local details emerge. This phenomenon underscores the critical need for models with a dynamic receptive field capable of capturing these intricate patterns.
  • Figure 2: The overall framework of MoCoLSK-Net primarily includes LST branch, guidance branch, MoCoLSK module, and reconstruction module. MoCoLSK-Net stacks $N$ stages, with each stage comprising two residual groups and one MoCoLSK module. The output from the $N$-th stage is fed into the reconstruction module. The downscaled HR LST is obtained by adding output of the reconstruction module to bicubic upsampling result of the original LR LST data.
  • Figure 3: Overview of our proposed MoCoLSK Module. The MoCoLSK module primarily consists of the large selective kernel pathway and the modality-conditioned weight generation pathway. For the LST pathway, it essentially follows the original LSK ICCV2023LSK module configuration but with two key differences: 1) the generation of the spatial selection mask $\bm{\widetilde{SA}}$ is modulated by the modality-conditioned weights from the MCWG pathway; 2) the output feature $\bm{Z}$ is the result of fusing two modality features. For the MCWG pathway, the HR LST features and HR guidance features are deeply fused using the pyramid pooling module and the dynamic MLP CVPR2022DMLP to generate modality-conditioned weights. Additionally, to facilitate modality fusion and the stacking of multiple MoCoLSK modules for more refined LST feature reconstruction, we utilize up-projection and down-projection units to upsample the LR LST features and downsample the concatenated result of the modality fusion output $\bm{Z}$ with the HR LST features $\bm{X}$.
  • Figure 4: Gallery of the GrokLST dataset, showcasing the comparison between LR (240 m) and HR (30 m) LST images, along with a suite of 10 auxiliary data.
  • Figure 5: Visual comparison of $\times$8 reconstruction images from different methods. Each method is represented by two images: the left image shows the reconstruction result, while the right image illustrates the difference between the reconstruction result and the GT. A Kelvin (K) temperature color bar is shown on the far right, where pixels with values larger than the GT are displayed in red, smaller values are displayed in blue, and identical values are shown in white.
  • ...and 5 more figures