Table of Contents
Fetching ...

Towards a multimodal framework for remote sensing image change retrieval and captioning

Roger Ferrod, Luigi Di Caro, Dino Ienco

TL;DR

A novel foundation model is proposed for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval.

Abstract

Recently, there has been increasing interest in multimodal applications that integrate text with other modalities, such as images, audio and video, to facilitate natural language interactions with multimodal AI systems. While applications involving standard modalities have been extensively explored, there is still a lack of investigation into specific data modalities such as remote sensing (RS) data. Despite the numerous potential applications of RS data, including environmental protection, disaster monitoring and land planning, available solutions are predominantly focused on specific tasks like classification, captioning and retrieval. These solutions often overlook the unique characteristics of RS data, such as its capability to systematically provide information on the same geographical areas over time. This ability enables continuous monitoring of changes in the underlying landscape. To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection, while maintaining captioning performances that are comparable to the state of the art. We release the source code and pretrained weights at: https://github.com/rogerferrod/RSICRC.

Towards a multimodal framework for remote sensing image change retrieval and captioning

TL;DR

A novel foundation model is proposed for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval.

Abstract

Recently, there has been increasing interest in multimodal applications that integrate text with other modalities, such as images, audio and video, to facilitate natural language interactions with multimodal AI systems. While applications involving standard modalities have been extensively explored, there is still a lack of investigation into specific data modalities such as remote sensing (RS) data. Despite the numerous potential applications of RS data, including environmental protection, disaster monitoring and land planning, available solutions are predominantly focused on specific tasks like classification, captioning and retrieval. These solutions often overlook the unique characteristics of RS data, such as its capability to systematically provide information on the same geographical areas over time. This ability enables continuous monitoring of changes in the underlying landscape. To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection, while maintaining captioning performances that are comparable to the state of the art. We release the source code and pretrained weights at: https://github.com/rogerferrod/RSICRC.
Paper Structure (16 sections, 18 equations, 3 figures, 4 tables)

This paper contains 16 sections, 18 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of our model; once the pair of images is encoded through two siamese pretrained models, the information is processed by a bi-temporal encoder that merges the two representations, then a single embedding can be retrieved through attentive pooling and contrastively compared with the corresponding textual embedding or used directly as input for the cross-modal decoder; the decoder is splited in two parts: unimodal layers that only encode the textual representation and multimodal layers that generate the captions.
  • Figure 2: Examples of items taken from the LEVIR-CC dataset, where each image pair (before/after) is accompanied by 5 human annotated captions (only one is shown here).
  • Figure 3: By performing contrastive learning, an anchor (image pair) is compared with the captions inside the batch, the corresponding textual description is considered a positive example, the others negative. If a False Negative is detected (caption similarities higher than $\theta$), one can exclude it from the loss computation (False Negative Elimination) or consider it as positive (False Negative Attraction).