SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
TL;DR
SeG-SR introduces semantic guidance into remote sensing image super-resolution by leveraging Vision-Language Models. It implements three components—Semantic Feature Extraction Module, Semantic Localization Module, and Learnable Modulation Module—to extract and inject high-level semantic information into SR units, improving fidelity and reducing semantically inconsistent artifacts. Across UCMerced, AID, and SIRI-WHU, SeG-SR achieves state-of-the-art PSNR/SSIM and enhances perceptual quality, with demonstrated generalizability when inserted into diverse SR architectures. The results highlight the value of incorporating semantic understanding in RSISR, albeit with added computational overhead from the VLM integration.
Abstract
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on three datasets, and consistently improves performance across various SR architectures. Notably, for the x4 SR task on UCMerced dataset, it attained a PSNR of 29.3042 dB and an SSIM of 0.7961.
