Table of Contents
Fetching ...

Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection

Ke Li, Fuyu Dong, Di Wang, Shaofeng Li, Quan Wang, Xinbo Gao, Tat-Seng Chua

TL;DR

This paper introduces a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence, and presents VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers.

Abstract

Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feed these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the capability to interact with users to identify changes that the users expect. In this paper, we introduce a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. To this end, we construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a valuable and diverse dataset for remote sensing applications. Furthermore, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic change detection-based visual question answering (CDVQA) and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for developing better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored research field. The proposed benchmark dataset and method are available at https://github.com/like413/VisTA.

Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection

TL;DR

This paper introduces a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence, and presents VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers.

Abstract

Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feed these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the capability to interact with users to identify changes that the users expect. In this paper, we introduce a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. To this end, we construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a valuable and diverse dataset for remote sensing applications. Furthermore, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic change detection-based visual question answering (CDVQA) and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for developing better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored research field. The proposed benchmark dataset and method are available at https://github.com/like413/VisTA.

Paper Structure

This paper contains 15 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Change detection (CD) identifies surface changes from multi-temporal images. Classic visual question answering (VQA) only supports textual answers. In comparison, the proposed change detection qusetion answering and grounding (CDQAG) supports well-founded answers, i.e., textual answers (“what has changed”) and relevant visual feedback (“where has changed”).
  • Figure 2: The proposed QAG-360K benchmark statistics: (a) the frequency of answer categories; (b) the distribution of question types (CN: change or not; CtW: change to what; CfW: change from what; IN: increase or not; DN: decrease or not; LC: largest change; SC: smallest change; CR: change ratio); (c) the word cloud of all questions; (d) the distribution of area ratio for visual masks.
  • Figure 3: Examples of the proposed QAG-360K dataset.
  • Figure 4: The architecture of our VisTA model as a simple baseline for the CDQAG task. Firstly, the given two remote sensing images and a question are encoded into vision features $F_v$ and language features $[F_s,F_w]$, respectively. $F_v$ and $F_w$ are fed into a vision-language decoder to produce the refined multimodal features $F_{vl}$. Next, the Q&A selector is used to generate Q&A features $F_s$. Subsequently, $F_s$ is activated as a selection weight to filter the pixel decoder's output, resulting in coarse mask $\text{M}_c$. Finally, the textual-visual answer decoder is employed to predict the textual answer and the corresponding visual answer.
  • Figure 5: Example results of our method on QAG-360K dataset.
  • ...and 2 more figures