Frequency-Adaptive Pan-Sharpening with Mixture of Experts
Xuanhua He, Keyu Yan, Rui Li, Chengjun Xie, Jie Zhang, Man Zhou
TL;DR
The paper tackles pan-sharpening by introducing frequency-adaptive processing through a learnable frequency mask and a mixture-of-experts (MOE) framework. The method, FAME-Net, uses a DCT-based frequency mask predictor to partition features into high- and low-frequency streams (HF-MOE and LF-MOE) and an Experts Mixture module to dynamically fuse these streams with PAN/MS information. Key contributions include the first integration of MOE with frequency-domain processing in pan-sharpening, a differentiable Gumbel-Softmax-based mask, and a composite loss with reconstruction, mask, and load terms that balance expert utilization. Experimental results on WV2, GF2, and WV3 demonstrate state-of-the-art performance and strong generalization to real-world scenes, supported by ablations and visualizations of frequency-aware feature maps.
Abstract
Pan-sharpening involves reconstructing missing high-frequency information in multi-spectral images with low spatial resolution, using a higher-resolution panchromatic image as guidance. Although the inborn connection with frequency domain, existing pan-sharpening research has not almost investigated the potential solution upon frequency domain. To this end, we propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening, which consists of three key components: the Adaptive Frequency Separation Prediction Module, the Sub-Frequency Learning Expert Module, and the Expert Mixture Module. In detail, the first leverages the discrete cosine transform to perform frequency separation by predicting the frequency mask. On the basis of generated mask, the second with low-frequency MOE and high-frequency MOE takes account for enabling the effective low-frequency and high-frequency information reconstruction. Followed by, the final fusion module dynamically weights high-frequency and low-frequency MOE knowledge to adapt to remote sensing images with significant content variations. Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes. Code will be made publicly at \url{https://github.com/alexhe101/FAME-Net}.
