Table of Contents
Fetching ...

Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models

Ehsan Zeraatkar, Salah A Faroughi, Jelena Tešić

TL;DR

This work tackles the spectral bias challenge in downsampling Earth System Model outputs by introducing two frequency-aware SR architectures, ViSIR and ViFOR, that blend Vision Transformers with frequency-sensitive representations. ViSIR extends ViT with sinusoidal activations in an INR to improve high-frequency detail, while ViFOR adds explicit Fourier-based filtering to decouple and learn low- and high-frequency content. On the E3SM-HR dataset, ViSIR substantially outperforms baselines, and ViFOR achieves state-of-the-art PSNR and SSIM across multiple climate variables, particularly when trained on full-field images. The results underscore the importance of global context and explicit frequency decomposition for climate data downscaling, with potential extensions to spatio-temporal SR and physics-constrained learning for broader scientific impact.

Abstract

Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.

Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models

TL;DR

This work tackles the spectral bias challenge in downsampling Earth System Model outputs by introducing two frequency-aware SR architectures, ViSIR and ViFOR, that blend Vision Transformers with frequency-sensitive representations. ViSIR extends ViT with sinusoidal activations in an INR to improve high-frequency detail, while ViFOR adds explicit Fourier-based filtering to decouple and learn low- and high-frequency content. On the E3SM-HR dataset, ViSIR substantially outperforms baselines, and ViFOR achieves state-of-the-art PSNR and SSIM across multiple climate variables, particularly when trained on full-field images. The results underscore the importance of global context and explicit frequency decomposition for climate data downscaling, with potential extensions to spatio-temporal SR and physics-constrained learning for broader scientific impact.

Abstract

Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Global scale HR image yields a very LR output when limited to a country scale.
  • Figure 2: ViSIR divides the input image into patches, pre-processes them using embedding and position encoding, and feeds the input to a visual transformer followed by the SIREN architecture.
  • Figure 3: ViFOR pipeline: SIREN is replaced by the Fourier-based activation function Network in the transformer and output sections
  • Figure 4: Panels (a), (b), and (c) show surface temperature, shortwave heat flux, and longwave heat flux, respectively, for the first month of year one obtained from the global fine-resolution configuration of E3SM.
  • Figure 5: PSNR across different cutoff frequencies $f_c$ for ViFOR. Optimal performance was achieved at $f_c=0.3$ Hz.
  • ...and 1 more figures