Linearly-evolved Transformer for Pan-sharpening

Junming Hou; Zihan Cao; Naishan Zheng; Xuan Li; Xiaoyu Chen; Xinyang Liu; Xiaofeng Cong; Man Zhou; Danfeng Hong

Linearly-evolved Transformer for Pan-sharpening

Junming Hou, Zihan Cao, Naishan Zheng, Xuan Li, Xiaoyu Chen, Xinyang Liu, Xiaofeng Cong, Man Zhou, Danfeng Hong

TL;DR

Addressing the high computational burden of transformer-based pan-sharpening, the paper introduces a linearly-evolved transformer (LFormer) that replaces the usual cascaded self-attention with a single transformer and a sequence of 1D convolutions, achieving linear complexity. The two-branch architecture fuses MS and PAN features while integrating Sobel-based high-frequency details, optimized with L1 reconstruction loss and SSIM structure loss. Experiments on WV3 and GF2 pan-sharpening benchmarks and hyperspectral fusion (CAVE) demonstrate competitive or superior performance with substantially fewer parameters and FLOPs. The approach offers a practical, scalable global modeling framework for satellite image fusion and extends to hyperspectral tasks.

Abstract

Vision transformer family has dominated the satellite pan-sharpening field driven by the global-wise spatial information modeling mechanism from the core self-attention ingredient. The standard modeling rules within these promising pan-sharpening methods are to roughly stack the transformer variants in a cascaded manner. Despite the remarkable advancement, their success may be at the huge cost of model parameters and FLOPs, thus preventing its application over low-resource satellites.To address this challenge between favorable performance and expensive computation, we tailor an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework. In detail, we deepen into the popular cascaded transformer modeling with cutting-edge methods and develop the alternative 1-order linearly-evolved transformer variant with the 1-dimensional linear convolution chain to achieve the same function. In this way, our proposed method is capable of benefiting the cascaded modeling rule while achieving favorable performance in the efficient manner. Extensive experiments over multiple satellite datasets suggest that our proposed method achieves competitive performance against other state-of-the-art with fewer computational resources. Further, the consistently favorable performance has been verified over the hyper-spectral image fusion task. Our main focus is to provide an alternative global modeling framework with an efficient structure. The code will be publicly available.

Linearly-evolved Transformer for Pan-sharpening

TL;DR

Abstract

Paper Structure (17 sections, 16 equations, 8 figures, 6 tables)

This paper contains 17 sections, 16 equations, 8 figures, 6 tables.

Introduction
Related Works
Pan-sharpening
Transformer Based Deep Learning Methods
Proposed Method
Overall Framework
The Underlying Principle of Linearly-evolved Transformer
Architecture Details
Loss Function
Experiments
Datasets and Experimental Settings
Comparison with SOTAs
Extension to Hyperspectral Task
Visualization of Feature Maps
Ablation Study
...and 2 more sections

Figures (8)

Figure 1: The comparison of PSNR and computational overhead between our model and other cutting-edge techniques. Notably, the Parameters axis is depicted using a logarithmic scale with a base of 2 for clear illustration. It is evident that our method showcases the promising performance-efficiency balance compared to other approaches.
Figure 2: The comparison between the prior cascaded self-attention designs within transformer and our proposed linearly-evolved mechanism. In this way, our linearly-evolved design is capable of inheriting the merits of a cascaded manner with the huge computation cost reduction.
Figure 3: Attention similarity. Illustration of attention maps across different layers from a cascaded vision transformer (ViT) architecture meng2022vision on the World-View3 testing dataset. $\mathrm{A_i (i=1,\cdots,5)}$ denotes the attention map from the i-th ViT block. The cosine similarity analysis reveals the high similarity among attention maps from various ViT blocks, resulting in feature representation redundancy and unnecessary computations. This motivates us to explore a more efficient alternative solution for effectively modeling feature dependencies, improving pan-sharpening performance, yet reducing the computational overhead.
Figure 4: Overall architecture of the proposed lightweight pan-sharpening framework. LFormer is the core design of our model, where self-attention is replaced by a novel linearly-evolved attention. Sobel and RCB denote Sobel operator and residual convolution block. $\mathcal{F}_w$ represents the linear weight function used for evolving the attention weights. For simplicity, herein, we opt for a straightforward 1-D convolution operator followed by the Softmax function to accomplish this fundamental design.
Figure 5: Comparison of the error maps between our model and other cutting-edge methods over WV3 dataset.
...and 3 more figures

Linearly-evolved Transformer for Pan-sharpening

TL;DR

Abstract

Linearly-evolved Transformer for Pan-sharpening

Authors

TL;DR

Abstract

Table of Contents

Figures (8)