Table of Contents
Fetching ...

Can SAR improve RSVQA performance?

Lucrezia Tosato, Sylvain Lobry, Flora Weissgerber, Laurent Wendling

TL;DR

The paper investigates adding Synthetic Aperture Radar (SAR) data to Remote Sensing Visual Question Answering (RSVQA) and proposes a three-phase workflow: SAR-only land-cover classification, fusion of SAR and optical data, and a VQA module that uses predicted class names as prompts for a language model. It demonstrates that a three-channel SAR input (VV, VH, and polarization ratio) improves land-cover accuracy by about $F_2$ macro score ~2% over two-channel inputs, and that late fusion generally outperforms early fusion when combining SAR and optical data. The study then translates classification results into VQA prompts using DistilBERT, finding that SAR+optical inputs can boost VQA accuracy for some questions and classes, particularly water-related categories, though overall improvements depend on class balance and data distribution. The results suggest SAR has meaningful potential to enhance RSVQA but also highlight the need for balanced datasets, data augmentation, and improved explainability (e.g., SAR captions) to realize robust, generalizable performance.

Abstract

Remote sensing visual question answering (RSVQA) has been involved in several research in recent years, leading to an increase in new methods. RSVQA automatically extracts information from satellite images, so far only optical, and a question to automatically search for the answer in the image and provide it in a textual form. In our research, we study whether Synthetic Aperture Radar (SAR) images can be beneficial to this field. We divide our study into three phases which include classification methods and VQA. In the first one, we explore the classification results of SAR alone and investigate the best method to extract information from SAR data. Then, we study the combination of SAR and optical data. In the last phase, we investigate how SAR images and a combination of different modalities behave in RSVQA compared to a method only using optical images. We conclude that adding the SAR modality leads to improved performances, although further research on using SAR data to automatically answer questions is needed as well as more balanced datasets.

Can SAR improve RSVQA performance?

TL;DR

The paper investigates adding Synthetic Aperture Radar (SAR) data to Remote Sensing Visual Question Answering (RSVQA) and proposes a three-phase workflow: SAR-only land-cover classification, fusion of SAR and optical data, and a VQA module that uses predicted class names as prompts for a language model. It demonstrates that a three-channel SAR input (VV, VH, and polarization ratio) improves land-cover accuracy by about macro score ~2% over two-channel inputs, and that late fusion generally outperforms early fusion when combining SAR and optical data. The study then translates classification results into VQA prompts using DistilBERT, finding that SAR+optical inputs can boost VQA accuracy for some questions and classes, particularly water-related categories, though overall improvements depend on class balance and data distribution. The results suggest SAR has meaningful potential to enhance RSVQA but also highlight the need for balanced datasets, data augmentation, and improved explainability (e.g., SAR captions) to realize robust, generalizable performance.

Abstract

Remote sensing visual question answering (RSVQA) has been involved in several research in recent years, leading to an increase in new methods. RSVQA automatically extracts information from satellite images, so far only optical, and a question to automatically search for the answer in the image and provide it in a textual form. In our research, we study whether Synthetic Aperture Radar (SAR) images can be beneficial to this field. We divide our study into three phases which include classification methods and VQA. In the first one, we explore the classification results of SAR alone and investigate the best method to extract information from SAR data. Then, we study the combination of SAR and optical data. In the last phase, we investigate how SAR images and a combination of different modalities behave in RSVQA compared to a method only using optical images. We conclude that adding the SAR modality leads to improved performances, although further research on using SAR data to automatically answer questions is needed as well as more balanced datasets.
Paper Structure (15 sections, 4 equations, 4 figures, 5 tables)

This paper contains 15 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Outline of the proposed method. We propose to extract land cover classes from SAR data or from SAR and optical data. These classes are then used as an input, with the question, to a language model that predicts the answer.
  • Figure 2: Distribution of the 19 classes from the BEN-MM dataset.
  • Figure 3: Distribution of the 36 classes present classes out of the original 61. The inner, middle, and outside circles represent L1, L2 and L3 classes respectively.
  • Figure 4: Comparison of SAR alone, optical alone and their late fusion combination in VQA.