Linearly-evolved Transformer for Pan-sharpening
Junming Hou, Zihan Cao, Naishan Zheng, Xuan Li, Xiaoyu Chen, Xinyang Liu, Xiaofeng Cong, Man Zhou, Danfeng Hong
TL;DR
Addressing the high computational burden of transformer-based pan-sharpening, the paper introduces a linearly-evolved transformer (LFormer) that replaces the usual cascaded self-attention with a single transformer and a sequence of 1D convolutions, achieving linear complexity. The two-branch architecture fuses MS and PAN features while integrating Sobel-based high-frequency details, optimized with L1 reconstruction loss and SSIM structure loss. Experiments on WV3 and GF2 pan-sharpening benchmarks and hyperspectral fusion (CAVE) demonstrate competitive or superior performance with substantially fewer parameters and FLOPs. The approach offers a practical, scalable global modeling framework for satellite image fusion and extends to hyperspectral tasks.
Abstract
Vision transformer family has dominated the satellite pan-sharpening field driven by the global-wise spatial information modeling mechanism from the core self-attention ingredient. The standard modeling rules within these promising pan-sharpening methods are to roughly stack the transformer variants in a cascaded manner. Despite the remarkable advancement, their success may be at the huge cost of model parameters and FLOPs, thus preventing its application over low-resource satellites.To address this challenge between favorable performance and expensive computation, we tailor an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework. In detail, we deepen into the popular cascaded transformer modeling with cutting-edge methods and develop the alternative 1-order linearly-evolved transformer variant with the 1-dimensional linear convolution chain to achieve the same function. In this way, our proposed method is capable of benefiting the cascaded modeling rule while achieving favorable performance in the efficient manner. Extensive experiments over multiple satellite datasets suggest that our proposed method achieves competitive performance against other state-of-the-art with fewer computational resources. Further, the consistently favorable performance has been verified over the hyper-spectral image fusion task. Our main focus is to provide an alternative global modeling framework with an efficient structure. The code will be publicly available.
