Table of Contents
Fetching ...

Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

Elman Ghazaei, Erchan Aptoula

TL;DR

This work tackles domain shift in CDVQA by introducing BrightVQA, a large, multi-modal, geo-disaster dataset designed for domain generalization, and the Text-Conditioned State Space Model (TCSSM). TCSSM jointly leverages bi-temporal imagery and publicly available geo-disaster textual descriptions to generate input-dependent parameters within a state-space framework, promoting domain-invariant representations for robust VQA. Empirical results on BrightVQA show TCSSM achieving state-of-the-art performance across ten regions, with comprehensive ablations validating each design choice, including cross-modal conditioning, Hadamard fusion, and block-depth. The dataset and model promise practical utility for disaster response under diverse, unseen geographies, and the authors provide publicly available code and data upon acceptance.

Abstract

The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.

Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

TL;DR

This work tackles domain shift in CDVQA by introducing BrightVQA, a large, multi-modal, geo-disaster dataset designed for domain generalization, and the Text-Conditioned State Space Model (TCSSM). TCSSM jointly leverages bi-temporal imagery and publicly available geo-disaster textual descriptions to generate input-dependent parameters within a state-space framework, promoting domain-invariant representations for robust VQA. Empirical results on BrightVQA show TCSSM achieving state-of-the-art performance across ten regions, with comprehensive ablations validating each design choice, including cross-modal conditioning, Hadamard fusion, and block-depth. The dataset and model promise practical utility for disaster response under diverse, unseen geographies, and the authors provide publicly available code and data upon acceptance.

Abstract

The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.

Paper Structure

This paper contains 29 sections, 14 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Illustration of the proposed TCSSM model.
  • Figure 2: Comparison of selection mechanisms in Mamba vs the proposed TCSSM.
  • Figure 3: Qualitative comparison of VQA model responses across various disaster-related questions.
  • Figure 4: Average accuracy (%) across all geographical regions of various VQA models evaluated with different training data sizes (10%, 20%, and 100%).