Table of Contents
Fetching ...

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning

Chenyang Liu, Keyan Chen, Zipeng Qi, Haotian Zhang, Zhengxia Zou, Zhenwei Shi

TL;DR

This work tackles the difficulty of remote sensing image change captioning (RSICC) in complex scenes by introducing pixel-level change detection (CD) as auxiliary supervision. It presents Pix4Cap, a dual-branch architecture with an auxiliary CD branch trained on CD pseudo-labels generated from pre-trained CD models and a Semantic Fusion Augment (SFA) that fuses pixel-level CD cues into the RSICC captioning stream via a Transformer decoder. The approach yields state-of-the-art results on the LEVIR-CC RSICC dataset, with ablations showing a 1.22% improvement on the $S^*_m$ metric due to CD pseudo-label learning. This demonstrates that pixel-level CD supervision can significantly enhance change captioning, especially in challenging scenes, and motivates future exploration of multi-class CD pseudo-labels. $L_{total}=L_{det}+L_{cap}$ combines the CD and caption losses to train the model end-to-end.

Abstract

The existing methods for Remote Sensing Image Change Captioning (RSICC) perform well in simple scenes but exhibit poorer performance in complex scenes. This limitation is primarily attributed to the model's constrained visual ability to distinguish and locate changes. Acknowledging the inherent correlation between change detection (CD) and RSICC tasks, we believe pixel-level CD is significant for describing the differences between images through language. Regrettably, the current RSICC dataset lacks readily available pixel-level CD labels. To address this deficiency, we leverage a model trained on existing CD datasets to derive CD pseudo-labels. We propose an innovative network with an auxiliary CD branch, supervised by pseudo-labels. Furthermore, a semantic fusion augment (SFA) module is proposed to fuse the feature information extracted by the CD branch, thereby facilitating the nuanced description of changes. Experiments demonstrate that our method achieves state-of-the-art performance and validate that learning pixel-level CD pseudo-labels significantly contributes to change captioning. Our code will be available at: https://github.com/Chen-Yang-Liu/Pix4Cap

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning

TL;DR

This work tackles the difficulty of remote sensing image change captioning (RSICC) in complex scenes by introducing pixel-level change detection (CD) as auxiliary supervision. It presents Pix4Cap, a dual-branch architecture with an auxiliary CD branch trained on CD pseudo-labels generated from pre-trained CD models and a Semantic Fusion Augment (SFA) that fuses pixel-level CD cues into the RSICC captioning stream via a Transformer decoder. The approach yields state-of-the-art results on the LEVIR-CC RSICC dataset, with ablations showing a 1.22% improvement on the metric due to CD pseudo-label learning. This demonstrates that pixel-level CD supervision can significantly enhance change captioning, especially in challenging scenes, and motivates future exploration of multi-class CD pseudo-labels. combines the CD and caption losses to train the model end-to-end.

Abstract

The existing methods for Remote Sensing Image Change Captioning (RSICC) perform well in simple scenes but exhibit poorer performance in complex scenes. This limitation is primarily attributed to the model's constrained visual ability to distinguish and locate changes. Acknowledging the inherent correlation between change detection (CD) and RSICC tasks, we believe pixel-level CD is significant for describing the differences between images through language. Regrettably, the current RSICC dataset lacks readily available pixel-level CD labels. To address this deficiency, we leverage a model trained on existing CD datasets to derive CD pseudo-labels. We propose an innovative network with an auxiliary CD branch, supervised by pseudo-labels. Furthermore, a semantic fusion augment (SFA) module is proposed to fuse the feature information extracted by the CD branch, thereby facilitating the nuanced description of changes. Experiments demonstrate that our method achieves state-of-the-art performance and validate that learning pixel-level CD pseudo-labels significantly contributes to change captioning. Our code will be available at: https://github.com/Chen-Yang-Liu/Pix4Cap
Paper Structure (10 sections, 6 equations, 2 figures, 1 table)

This paper contains 10 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of our Pix4cap method. Left: the overview of our model. The model comprises two branches: change captioning and auxiliary change detection. The SFA module acts as a pivotal bridge between the two branches. Right: the structure of some important modules.
  • Figure 2: The comparison between one of five reference captions and the caption generated by our model.