CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening
Lihua Jian, Jiabo Liu, Shaowu Wu, Lihui Chen
TL;DR
This work tackles the GT-driven domain gap in pansharpening by introducing CLIPPan, a two-stage framework that repurposes CLIP as a language-guided supervisor for unsupervised, full-resolution pansharpening. Stage I lightweightly adapts CLIP to recognize MS, PAN, and HRMS modalities and to understand fusion, via InterMCL, IntraMCL, and fusion-aware alignment. Stage II leverages semantic supervision from language (via Wald's protocol) together with low-level reconstruction losses to train pansharpening without ground truth, achieving state-of-the-art performance on real datasets. The approach reduces reliance on GT, enables protocol-informed supervision, and opens avenues for discovering novel pansharpening protocols through language-backed guidance.
Abstract
Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios.To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel \textit{loss integrating semantic language constraints}, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.
