Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images
Lucrezia Tosato, Hichem Boussaid, Flora Weissgerber, Camille Kurtz, Laurent Wendling, Sylvain Lobry
TL;DR
This paper tackles RSVQA by integrating segmentation-guided attention into a RSVQA pipeline, leveraging a multi-channel segmentation representation to focus attention on relevant objects in remote sensing imagery. It introduces a new dataset built from very high-resolution orthophotos with 16 segmentation classes and automatically generated QA pairs, enabling evaluation of segmentation-guided attention in this domain. The proposed approach combines ResNet-50 visual features, DistilBERT questions, and a frozen segmentation module to produce attention that improves OA by about 10 percentage points over vanilla attention, achieving OA ~45.44% and AA ~43.24%. The results indicate segmentation guidance helps locate objects and supports question-conditioned reasoning, though spatial-relations questions remain challenging, pointing to directions for richer spatial modeling and dataset expansion.
Abstract
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image. The visual features extraction is therefore an essential step in a VQA pipeline. By incorporating attention mechanisms into this process, models gain the ability to focus selectively on salient regions of the image, prioritizing the most relevant visual information for a given question. In this work, we propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline. We argue that segmentation plays a crucial role in guiding attention by providing a contextual understanding of the visual information, underlying specific objects or areas of interest. To evaluate this methodology, we provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs. Our study shows promising results of our new methodology, gaining almost 10% of overall accuracy compared to a classical method on the proposed dataset.
