Table of Contents
Fetching ...

Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

Lucrezia Tosato, Hichem Boussaid, Flora Weissgerber, Camille Kurtz, Laurent Wendling, Sylvain Lobry

TL;DR

This paper tackles RSVQA by integrating segmentation-guided attention into a RSVQA pipeline, leveraging a multi-channel segmentation representation to focus attention on relevant objects in remote sensing imagery. It introduces a new dataset built from very high-resolution orthophotos with 16 segmentation classes and automatically generated QA pairs, enabling evaluation of segmentation-guided attention in this domain. The proposed approach combines ResNet-50 visual features, DistilBERT questions, and a frozen segmentation module to produce attention that improves OA by about 10 percentage points over vanilla attention, achieving OA ~45.44% and AA ~43.24%. The results indicate segmentation guidance helps locate objects and supports question-conditioned reasoning, though spatial-relations questions remain challenging, pointing to directions for richer spatial modeling and dataset expansion.

Abstract

Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image. The visual features extraction is therefore an essential step in a VQA pipeline. By incorporating attention mechanisms into this process, models gain the ability to focus selectively on salient regions of the image, prioritizing the most relevant visual information for a given question. In this work, we propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline. We argue that segmentation plays a crucial role in guiding attention by providing a contextual understanding of the visual information, underlying specific objects or areas of interest. To evaluate this methodology, we provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs. Our study shows promising results of our new methodology, gaining almost 10% of overall accuracy compared to a classical method on the proposed dataset.

Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

TL;DR

This paper tackles RSVQA by integrating segmentation-guided attention into a RSVQA pipeline, leveraging a multi-channel segmentation representation to focus attention on relevant objects in remote sensing imagery. It introduces a new dataset built from very high-resolution orthophotos with 16 segmentation classes and automatically generated QA pairs, enabling evaluation of segmentation-guided attention in this domain. The proposed approach combines ResNet-50 visual features, DistilBERT questions, and a frozen segmentation module to produce attention that improves OA by about 10 percentage points over vanilla attention, achieving OA ~45.44% and AA ~43.24%. The results indicate segmentation guidance helps locate objects and supports question-conditioned reasoning, though spatial-relations questions remain challenging, pointing to directions for richer spatial modeling and dataset expansion.

Abstract

Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image. The visual features extraction is therefore an essential step in a VQA pipeline. By incorporating attention mechanisms into this process, models gain the ability to focus selectively on salient regions of the image, prioritizing the most relevant visual information for a given question. In this work, we propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline. We argue that segmentation plays a crucial role in guiding attention by providing a contextual understanding of the visual information, underlying specific objects or areas of interest. To evaluate this methodology, we provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs. Our study shows promising results of our new methodology, gaining almost 10% of overall accuracy compared to a classical method on the proposed dataset.
Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Distribution of answers by question type. We omit numerical answers labeling and we show them ordered. The maximum numerical values are 280 (counting questions), 40000m2 (area questions), 273m (distance questions).
  • Figure 2: Graphical outline of the proposed architecture. The inputs (very high resolution remote sensing images and language questions) are shown in blue frames, the outputs in red frames (answer and segmentation) and the ground truths (answer and segmentation) in green frames.
  • Figure 3: Example of an image in department Hauts-de-Seine, 92 with questions, ground truths and predictions.