Table of Contents
Fetching ...

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang

TL;DR

ScaleFormer is proposed, a novel architecture designed for multi-scale pansharpening that outperforms SOTA methods in fusion quality and cross-scale generalization, and incorporates Rotary Positional Encoding to enhance extrapolation to unseen scales.

Abstract

Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

TL;DR

ScaleFormer is proposed, a novel architecture designed for multi-scale pansharpening that outperforms SOTA methods in fusion quality and cross-scale generalization, and incorporates Rotary Positional Encoding to enhance extrapolation to unseen scales.

Abstract

Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
Paper Structure (27 sections, 8 equations, 10 figures, 15 tables)

This paper contains 27 sections, 8 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Challenges in cross-scale pansharpening. (a) Learning/modeling at limited resolution while supporting inference on high-resolution images. (b) GPU memory increases rapidly with resolution, especially for Transformer-based models. (c) Tiled inference can introduce block artifacts in some methods. (d) A scale-induced distribution shift exists between training-scale inputs and large-scale inference.
  • Figure 2: Comparison of convolution, Transformer, and the proposed ScaleFormer. (a) Convolution: linear in image size but with a limited receptive field on high-resolution inputs; enlarging kernels helps but adds quadratic cost in kernel size. (b) Transformer: uniform patchification enables global context, yet self-attention scales quadratically with patch count, driving computation up on large images. (c) ScaleFormer (ours): introduces a sequence axis, factorizing global modeling into fixed-size spatial processing plus a scalable sequence dimension, yielding controllable complexity and mitigating scale-induced distribution shifts for robust performance across resolutions.
  • Figure 3: PanScale dataset: composition and distribution, resolution range, geographic sampling locations, and representative scene visualizations.
  • Figure 4: The overall architecture of ScaleFormer, which primarily consists of three key components: the Scale-Aware Patchify module implementing the bucket sampling strategy, the Single Transformer Modules for intra-modal feature modeling, and the Cross Transformer Modules for inter-modal feature interaction and fusion.
  • Figure 5: Comparison results across increasing resolution scales. The four subplots respectively depict: (a) PSNR score on the Landsat dataset, (b) PSNR score on the Skysat dataset, (c) model computational complexity (measured in GFLOPs) as a function of input resolution, and (d) Memory variation concerning input resolution.
  • ...and 5 more figures