Table of Contents
Fetching ...

SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi

TL;DR

SeG-SR introduces semantic guidance into remote sensing image super-resolution by leveraging Vision-Language Models. It implements three components—Semantic Feature Extraction Module, Semantic Localization Module, and Learnable Modulation Module—to extract and inject high-level semantic information into SR units, improving fidelity and reducing semantically inconsistent artifacts. Across UCMerced, AID, and SIRI-WHU, SeG-SR achieves state-of-the-art PSNR/SSIM and enhances perceptual quality, with demonstrated generalizability when inserted into diverse SR architectures. The results highlight the value of incorporating semantic understanding in RSISR, albeit with added computational overhead from the VLM integration.

Abstract

High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on three datasets, and consistently improves performance across various SR architectures. Notably, for the x4 SR task on UCMerced dataset, it attained a PSNR of 29.3042 dB and an SSIM of 0.7961.

SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

TL;DR

SeG-SR introduces semantic guidance into remote sensing image super-resolution by leveraging Vision-Language Models. It implements three components—Semantic Feature Extraction Module, Semantic Localization Module, and Learnable Modulation Module—to extract and inject high-level semantic information into SR units, improving fidelity and reducing semantically inconsistent artifacts. Across UCMerced, AID, and SIRI-WHU, SeG-SR achieves state-of-the-art PSNR/SSIM and enhances perceptual quality, with demonstrated generalizability when inserted into diverse SR architectures. The results highlight the value of incorporating semantic understanding in RSISR, albeit with added computational overhead from the VLM integration.

Abstract

High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on three datasets, and consistently improves performance across various SR architectures. Notably, for the x4 SR task on UCMerced dataset, it attained a PSNR of 29.3042 dB and an SSIM of 0.7961.

Paper Structure

This paper contains 26 sections, 15 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The previous vanilla SR framework is illustrated in (a), while our proposed SR framework is shown in (b). Our framework introduces semantic guidance information for each SR unit, thereby guiding the SR process.
  • Figure 2: An overview of the proposed SeG-SR. The LR image is first processed by the Semantic Feature Extraction Module (SFEM) to obtain both global and local semantic features. The global features are fed into the Semantic Localization Module (SLM) to generate per-unit localization embeddings. These embeddings are then matched with the local features to produce semantic guidance maps, which are subsequently used to guide the super-resolution process through Learnable Modulation Module (LMM).
  • Figure 3: The structure of the proposed SLM. The MetaNet is used to generate the global feature vector, the self-attention layer interacts with and integrates global and local feature vectors, and the Gated Fusion module produces the final semantic localization embeddings.
  • Figure 4: The structure of the proposed LMM. LMM incorporates semantic information into the SR process by modulating the output features of each SR unit using the corresponding semantic guidance map
  • Figure 5: Super-resolution ($\times$ 4) results of various SR methods on the UCMerced dataset. The first two rows display reconstruction outputs from image "agricultural 85", while the latter two rows show results from image "tennis court 98".
  • ...and 2 more figures