Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning
Chenyang Liu, Keyan Chen, Zipeng Qi, Haotian Zhang, Zhengxia Zou, Zhenwei Shi
TL;DR
This work tackles the difficulty of remote sensing image change captioning (RSICC) in complex scenes by introducing pixel-level change detection (CD) as auxiliary supervision. It presents Pix4Cap, a dual-branch architecture with an auxiliary CD branch trained on CD pseudo-labels generated from pre-trained CD models and a Semantic Fusion Augment (SFA) that fuses pixel-level CD cues into the RSICC captioning stream via a Transformer decoder. The approach yields state-of-the-art results on the LEVIR-CC RSICC dataset, with ablations showing a 1.22% improvement on the $S^*_m$ metric due to CD pseudo-label learning. This demonstrates that pixel-level CD supervision can significantly enhance change captioning, especially in challenging scenes, and motivates future exploration of multi-class CD pseudo-labels. $L_{total}=L_{det}+L_{cap}$ combines the CD and caption losses to train the model end-to-end.
Abstract
The existing methods for Remote Sensing Image Change Captioning (RSICC) perform well in simple scenes but exhibit poorer performance in complex scenes. This limitation is primarily attributed to the model's constrained visual ability to distinguish and locate changes. Acknowledging the inherent correlation between change detection (CD) and RSICC tasks, we believe pixel-level CD is significant for describing the differences between images through language. Regrettably, the current RSICC dataset lacks readily available pixel-level CD labels. To address this deficiency, we leverage a model trained on existing CD datasets to derive CD pseudo-labels. We propose an innovative network with an auxiliary CD branch, supervised by pseudo-labels. Furthermore, a semantic fusion augment (SFA) module is proposed to fuse the feature information extracted by the CD branch, thereby facilitating the nuanced description of changes. Experiments demonstrate that our method achieves state-of-the-art performance and validate that learning pixel-level CD pseudo-labels significantly contributes to change captioning. Our code will be available at: https://github.com/Chen-Yang-Liu/Pix4Cap
