Table of Contents
Fetching ...

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Yongshuo Zhu, Lu Li, Keyan Chen, Chenyang Liu, Fugen Zhou, Zhenwei Shi

TL;DR

This work tackles the challenge of robust, granular remote sensing image change captioning (RSICC) by introducing Semantic-CC, a four-component framework that leverages foundation-model latent knowledge and pixel-level guidance from change detection (CD). The architecture comprises a bi-temporal SAM-based encoder, a multi-task semantic aggregation neck, a simple multi-scale CD decoder, and a Vicuna-based CC decoder with a change semantic feature enhancer, all trained under a three-stage strategy to avoid negative transfer. Empirical results on LEVIR-CD and LEVIR-CC show that CD and CC mutually reinforce each other, yielding superior change masks and high-quality captions compared with state-of-the-art methods. The approach highlights the practical potential of cross-task semantic guidance and foundation-model priors for scalable, high-precision RSICC with limited annotation requirements.

Abstract

Remote sensing image change captioning (RSICC) aims to articulate the changes in objects of interest within bi-temporal remote sensing images using natural language. Given the limitations of current RSICC methods in expressing general features across multi-temporal and spatial scenarios, and their deficiency in providing granular, robust, and precise change descriptions, we introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance, which we term Semantic-CC. Semantic-CC alleviates the dependency of high-generalization algorithms on extensive annotations by harnessing the latent knowledge of foundation models, and it generates more comprehensive and accurate change descriptions guided by pixel-level semantics from change detection (CD). Specifically, we propose a bi-temporal SAM-based encoder for dual-image feature extraction; a multi-task semantic aggregation neck for facilitating information interaction between heterogeneous tasks; a straightforward multi-scale change detection decoder to provide pixel-level semantic guidance; and a change caption decoder based on the large language model (LLM) to generate change description sentences. Moreover, to ensure the stability of the joint training of CD and CC, we propose a three-stage training strategy that supervises different tasks at various stages. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets. The experimental results corroborate the complementarity of CD and CC, demonstrating that Semantic-CC can generate more accurate change descriptions and achieve optimal performance across both tasks.

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

TL;DR

This work tackles the challenge of robust, granular remote sensing image change captioning (RSICC) by introducing Semantic-CC, a four-component framework that leverages foundation-model latent knowledge and pixel-level guidance from change detection (CD). The architecture comprises a bi-temporal SAM-based encoder, a multi-task semantic aggregation neck, a simple multi-scale CD decoder, and a Vicuna-based CC decoder with a change semantic feature enhancer, all trained under a three-stage strategy to avoid negative transfer. Empirical results on LEVIR-CD and LEVIR-CC show that CD and CC mutually reinforce each other, yielding superior change masks and high-quality captions compared with state-of-the-art methods. The approach highlights the practical potential of cross-task semantic guidance and foundation-model priors for scalable, high-precision RSICC with limited annotation requirements.

Abstract

Remote sensing image change captioning (RSICC) aims to articulate the changes in objects of interest within bi-temporal remote sensing images using natural language. Given the limitations of current RSICC methods in expressing general features across multi-temporal and spatial scenarios, and their deficiency in providing granular, robust, and precise change descriptions, we introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance, which we term Semantic-CC. Semantic-CC alleviates the dependency of high-generalization algorithms on extensive annotations by harnessing the latent knowledge of foundation models, and it generates more comprehensive and accurate change descriptions guided by pixel-level semantics from change detection (CD). Specifically, we propose a bi-temporal SAM-based encoder for dual-image feature extraction; a multi-task semantic aggregation neck for facilitating information interaction between heterogeneous tasks; a straightforward multi-scale change detection decoder to provide pixel-level semantic guidance; and a change caption decoder based on the large language model (LLM) to generate change description sentences. Moreover, to ensure the stability of the joint training of CD and CC, we propose a three-stage training strategy that supervises different tasks at various stages. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets. The experimental results corroborate the complementarity of CD and CC, demonstrating that Semantic-CC can generate more accurate change descriptions and achieve optimal performance across both tasks.
Paper Structure (24 sections, 9 equations, 9 figures, 5 tables)

This paper contains 24 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The architecture of the Semantic-CC consists of four main components: a bi-temporal SAM-based encoder, a multi-task semantic aggregation neck, a change detection decoder, and a change caption decoder.
  • Figure 2: The structure of the bi-temporal change semantic filter (BCSF).
  • Figure 3: The overview of the multi-task semantic aggregation neck.
  • Figure 4: The structure of the inter-task attention unit.
  • Figure 5: Change semantic feature enhancer
  • ...and 4 more figures