Table of Contents
Fetching ...

Training and Inference within 1 Second -- Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

Tianyu Xin, Jin-Liang Xiao, Zeyu Xia, Shan Yin, Liang-Jian Deng

TL;DR

This work tackles cross-sensor degradation in real-world pansharpening by introducing Efficient Residual Feature Tailoring (ERFT), a plug-and-play framework that inserts a lightweight Feature Tailor between a frozen Feature Extractor and Channel Mapper to adapt fused features at the feature level. It employs a patch-wise training/inference scheme and physics-aware unsupervised losses (spectral, spatial, and consistency) to achieve cross-sensor generalization without extra training data, while maintaining pretrained backbone capabilities. Empirical results on four real-world sensor datasets demonstrate state-of-the-art fusion quality and sub-second runtimes, including megapixel-scale pansharpening, with substantial speedups over zero-shot and retraining approaches. Overall, ERFT offers a scalable, data-efficient, and practical solution for real-world cross-sensor pansharpening with broad applicability to modern remote sensing pipelines.

Abstract

Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $\textit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $\textit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512\times512\times8$ image within $\textit{0.2 seconds}$ and $4000\times4000\times8$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

Training and Inference within 1 Second -- Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

TL;DR

This work tackles cross-sensor degradation in real-world pansharpening by introducing Efficient Residual Feature Tailoring (ERFT), a plug-and-play framework that inserts a lightweight Feature Tailor between a frozen Feature Extractor and Channel Mapper to adapt fused features at the feature level. It employs a patch-wise training/inference scheme and physics-aware unsupervised losses (spectral, spatial, and consistency) to achieve cross-sensor generalization without extra training data, while maintaining pretrained backbone capabilities. Empirical results on four real-world sensor datasets demonstrate state-of-the-art fusion quality and sub-second runtimes, including megapixel-scale pansharpening, with substantial speedups over zero-shot and retraining approaches. Overall, ERFT offers a scalable, data-efficient, and practical solution for real-world cross-sensor pansharpening with broad applicability to modern remote sensing pipelines.

Abstract

Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) : it significantly enhance performance in cross-sensor cases. (2) : it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of image within and image within at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

Paper Structure

This paper contains 49 sections, 26 equations, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of different cross-sensor pansharpening methods. Existing approaches (upper panel) struggle to balance efficiency, quality, and data requirement. In contrast, our method (lower panel) achieves state-of-the-art performance with sub-second efficiency, requires only test-time inputs, and fully leverages pretrained model capabilities.
  • Figure 2: Our Efficient Residual Feature Tailoring pipeline is conducted in a patch-wise manner. Specifically, (1) Random selected patches are selected for unsupervised training of the Feature Tailor, enabling the feature-level adjustments; (2) Parallel inference is conducted on all split patches, whose resulting HRMS patches are stitched together to form the final HRMS image.
  • Figure 3: HQNR comparison for FusionNet and U2Net on WV3 dataset under three FT placement strategies: FT not inserted (Baseline), FT inserted before channel mapping (pre-CM), and FT inserted after channel mapping (post-CM). The pre-CM configuration consistently outperforms the others, validating the effectiveness of feature-level adjustments.
  • Figure 4: Detailed workflow of unsupervised training. The LRMS image $\mathbf{Y}$ and PAN image $\mathbf{P}$ are fed into the backbone network to extract high-dimensional latent features $\mathbf{Z}$, which are then refined by the FT module to produce tailored features $\mathbf{Z}^*$. Both $\mathbf{Z}$ and $\mathbf{Z}^*$ are passed through the channel mapping (CM) module to generate the original and tailored HRMS outputs, $\mathbf{\hat{X}}$ and $\mathbf{\hat{X}}^*$, respectively. These outputs are compared with the inputs to compute unsupervised losses to update the FT module.
  • Figure 5: Visual Fusion Examples and HQNR Map in two cross-sensor cases: QB $\to$ GF2 (upper) and WV3 $\to$ WV2 (lower).
  • ...and 7 more figures