Table of Contents
Fetching ...

ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models

Ehsan Zeraatkar, Salah Faroughi, Jelena Tešić

TL;DR

This work introduces ViSIR, a hybrid Vision Transformer–SIREN framework for single-image super-resolution of Earth System Model outputs. By embedding a SIREN-based, frequency-tuned implicit representation into the ViT final layer, ViSIR effectively mitigates spectral bias and preserves high-frequency details in SR tasks. Across an E3SM-derived benchmark dataset, ViSIR achieves substantial improvements in PSNR, SSIM, and MSE over ViT, SIREN, SRCNN, and SRGAN baselines, including notable gains of over 10 dB PSNR relative to SIREN and strong performance in corner cases. The approach promises enhanced high-resolution climate imagery fidelity, with future work aimed at efficiency, multi-image/video extension, and uncertainty quantification to support practical deployment in climate modeling and decision-making.

Abstract

Purpose: Earth system models (ESMs) integrate the interactions of the atmosphere, ocean, land, ice, and biosphere to estimate the state of regional and global climate under a wide variety of conditions. The ESMs are highly complex; thus, deep neural network architectures are used to model the complexity and store the down-sampled data. This paper proposes the Vision Transformer Sinusoidal Representation Networks (ViSIR) to improve the ESM data's single image SR (SR) reconstruction task. Methods: ViSIR combines the SR capability of Vision Transformers (ViT) with the high-frequency detail preservation of the Sinusoidal Representation Network (SIREN) to address the spectral bias observed in SR tasks. Results: The ViSIR outperforms SRCNN by 2.16 db, ViT by 6.29 dB, SIREN by 8.34 dB, and SR-Generative Adversarial (SRGANs) by 7.93 dB PSNR on average for three different measurements. Conclusion: The proposed ViSIR is evaluated and compared with state-of-the-art methods. The results show that the proposed algorithm is outperforming other methods in terms of Mean Square Error(MSE), Peak-Signal-to-Noise-Ratio(PSNR), and Structural Similarity Index Measure(SSIM).

ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models

TL;DR

This work introduces ViSIR, a hybrid Vision Transformer–SIREN framework for single-image super-resolution of Earth System Model outputs. By embedding a SIREN-based, frequency-tuned implicit representation into the ViT final layer, ViSIR effectively mitigates spectral bias and preserves high-frequency details in SR tasks. Across an E3SM-derived benchmark dataset, ViSIR achieves substantial improvements in PSNR, SSIM, and MSE over ViT, SIREN, SRCNN, and SRGAN baselines, including notable gains of over 10 dB PSNR relative to SIREN and strong performance in corner cases. The approach promises enhanced high-resolution climate imagery fidelity, with future work aimed at efficiency, multi-image/video extension, and uncertainty quantification to support practical deployment in climate modeling and decision-making.

Abstract

Purpose: Earth system models (ESMs) integrate the interactions of the atmosphere, ocean, land, ice, and biosphere to estimate the state of regional and global climate under a wide variety of conditions. The ESMs are highly complex; thus, deep neural network architectures are used to model the complexity and store the down-sampled data. This paper proposes the Vision Transformer Sinusoidal Representation Networks (ViSIR) to improve the ESM data's single image SR (SR) reconstruction task. Methods: ViSIR combines the SR capability of Vision Transformers (ViT) with the high-frequency detail preservation of the Sinusoidal Representation Network (SIREN) to address the spectral bias observed in SR tasks. Results: The ViSIR outperforms SRCNN by 2.16 db, ViT by 6.29 dB, SIREN by 8.34 dB, and SR-Generative Adversarial (SRGANs) by 7.93 dB PSNR on average for three different measurements. Conclusion: The proposed ViSIR is evaluated and compared with state-of-the-art methods. The results show that the proposed algorithm is outperforming other methods in terms of Mean Square Error(MSE), Peak-Signal-to-Noise-Ratio(PSNR), and Structural Similarity Index Measure(SSIM).

Paper Structure

This paper contains 12 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: ViSIR divides the input image into patches, pre-processes them using embedding and position encoding, and feeds the input to a visual transformer followed by the SIREN architecture.
  • Figure 2: 2D (left) and 3D (right) illustration of the PSNR values for different Frequencies and different numbers of hidden layers used in the proposed ViSIR applied to 180 images of the Surface Temperature variable.
  • Figure 3: Panels (a), (b), and (c) show surface temperature, shortwave heat flux, and longwave heat flux, respectively, for the first month of year one obtained from the global fine-resolution configuration of E3SMNikhil2024.
  • Figure 4: Max Mean, and Min PSNR, MSE, and SSIM values over Source Temperature measurements.
  • Figure 5: Image reconstruction from low-resolution Surface Temperature image using ViSIR.
  • ...and 1 more figures