Table of Contents
Fetching ...

Visual Question Answering on Multiple Remote Sensing Image Modalities

Hichem Boussaid, Lucrezia Tosato, Flora Weissgerber, Camille Kurtz, Laurent Wendling, Sylvain Lobry

TL;DR

This work tackles visual question answering in remote sensing by leveraging multiple image modalities and resolutions. It introduces TAMMI, a large-scale dataset that pairs very high-resolution orthophotos, multispectral Sentinel-2 data, and Sentinel-1 SAR with automatically generated, balanced QA pairs, and proposes MM-RSVQA, a VisualBERT-based fusion model that jointly processes these modalities and natural language questions. The authors show that triple-modal fusion improves VQA performance across diverse question types, and provide ablations demonstrating the added value of MS and SAR context over VHR alone. The dataset and baseline model establish a new multi-modal, multi-resolution RSVQA benchmark with potential applicability to other imaging domains such as medical imaging, and they offer an extensible pipeline for expanding regions and modalities.

Abstract

The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at https://tammi.sylvainlobry.com/.

Visual Question Answering on Multiple Remote Sensing Image Modalities

TL;DR

This work tackles visual question answering in remote sensing by leveraging multiple image modalities and resolutions. It introduces TAMMI, a large-scale dataset that pairs very high-resolution orthophotos, multispectral Sentinel-2 data, and Sentinel-1 SAR with automatically generated, balanced QA pairs, and proposes MM-RSVQA, a VisualBERT-based fusion model that jointly processes these modalities and natural language questions. The authors show that triple-modal fusion improves VQA performance across diverse question types, and provide ablations demonstrating the added value of MS and SAR context over VHR alone. The dataset and baseline model establish a new multi-modal, multi-resolution RSVQA benchmark with potential applicability to other imaging domains such as medical imaging, and they offer an extensible pipeline for expanding regions and modalities.

Abstract

The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at https://tammi.sylvainlobry.com/.

Paper Structure

This paper contains 24 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Summary of our contributions. We introduce a new task in the computer vision community with multi-modal and multi-resolution Visual Question Answering (VQA) on remote sensing images. We introduce a new dataset, TAMMI, associating question/answer pairs to multi-spectral, Very High-Resolution (VHR) orthophotos and Synthetic Aperture Radar (SAR) images triplets. In these examples, the white rectangle in the multi-spectral and SAR images corresponds to the extent of the VHR image. Finally, we propose a new model for this task referred as MM-RSVQA.
  • Figure 2: Geographical extent of the TAMMI dataset, covering selected regions (highlighted in red) in Metropolitan France: Paris and inner suburbs (departments 75, 92, 93, 94), an urban region; Haute-Savoie (74), a mountainous region; and Hérault (34), a seaside region. For each of these regions, we show a VHR sample from the dataset.
  • Figure 3: Graphical outline of the proposed MM-RSVQA (Multi-modal Multi-resolution RSVQA) architecture. The inputs of the model (multi-modal imagery and textual question) are represented on the left and the output (predicted answer) is on the bottom right. First, we extract features from each image modality and perform a text embedding. These features are passed through a vision-language model (VLM) to obtain a vector which can be classified among a set of pre-defined answers. The different blocks composing the system are detailed in \ref{['sec:method']}.
  • Figure 4: Sentinel-1 SLC images before (a) and after (b) the application of the proposed debursting method. The visualization is done using a threshold of 233.