CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

Yiming Xiao; Kai Yin; Ali Mostafavi

CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

Yiming Xiao, Kai Yin, Ali Mostafavi

TL;DR

CrisiSense-RAG addresses the challenge of rapid, spatially resolved disaster impact assessment amid temporally asynchronous data by introducing a split-pipeline multimodal retrieval-augmented generation framework. By separating Text Analyst and Visual Analyst reasoning and employing asynchronous fusion, it prioritizes real-time social reports for flood extent while treating post-event imagery as persistent evidence of damage, all under metric-aligned generation. Zero-shot evaluation on Hurricane Harvey across three foundation-model backends shows competitive flood-extent and damage predictions (e.g., Extent MAE from $10.94\%$ to $28.40\%$, Damage MAE from $16.47\%$ to $21.65\%$), with prompt engineering contributing up to $4.75$ percentage points improvement. The work demonstrates that general-purpose pretrained models can deliver practical, auditable resilience intelligence without event-specific fine-tuning, offering a deployable pathway for emergency management under real-world data constraints, while also outlining limitations and directions for future multi-hazard extensions and uncertainty quantification.

Abstract

Timely and spatially resolved disaster impact assessment is essential for effective emergency response. However, automated methods typically struggle with temporal asynchrony. Real-time human reports capture peak hazard conditions while high-resolution satellite imagery is frequently acquired after peak conditions. This often reflects flood recession rather than maximum extent. Naive fusion of these misaligned streams can yield dangerous underestimates when post-event imagery overrides documented peak flooding. We present CrisiSense-RAG, which is a multimodal retrieval-augmented generation framework that reframes impact assessment as evidence synthesis over heterogeneous data sources without disaster-specific fine-tuning. The system employs hybrid dense-sparse retrieval for text sources and CLIP-based retrieval for aerial imagery. A split-pipeline architecture feeds into asynchronous fusion logic that prioritizes real-time social evidence for peak flood extent while treating imagery as persistent evidence of structural damage. Evaluated on Hurricane Harvey across 207 ZIP-code queries, the framework achieves a flood extent MAE of 10.94% to 28.40% and damage severity MAE of 16.47% to 21.65% in zero-shot settings. Prompt-level alignment proves critical for quantitative validity because metric grounding improves damage estimates by up to 4.75 percentage points. These results demonstrate a practical and deployable approach to rapid resilience intelligence under real-world data constraints.

CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

TL;DR

, Damage MAE from

), with prompt engineering contributing up to

percentage points improvement. The work demonstrates that general-purpose pretrained models can deliver practical, auditable resilience intelligence without event-specific fine-tuning, offering a deployable pathway for emergency management under real-world data constraints, while also outlining limitations and directions for future multi-hazard extensions and uncertainty quantification.

Abstract

Paper Structure (35 sections, 4 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Remote Sensing and Automated Damage Assessment
Social Sensing: Opportunities and Uncertainties
Multimodal Data Fusion and Resilience Frameworks
Multimodal Retrieval-Augmented Generation (RAG)
Methodology
Task Definition
Data Sources and Preprocessing
Multimodal Retrieval
Split-Pipeline Architecture
Reasoning and Alignment Strategy
Experiments and Results
Dataset
Ablation Study Design
...and 20 more sections

Figures (4)

Figure 1: Overview of the proposed architecture. Our system ingests multimodal data (aerial imagery, social media, 311 calls, sensors), retrieves relevant context, and employs a split-pipeline approach with separate text and visual analyzers before fusing results.
Figure 2: Example of a retrieved aerial imagery tile (Aug 31, 2017) and its machine-generated caption. The caption successfully identifies key features: "Significant flooding is visible, with brown water inundating areas along a winding waterway… and encroaching on the highway in several places."
Figure 3: Performance comparison across methods and models on the full study area ($N=207$). Each panel shows results for one model (Gemini 2.5 Flash, Llama 3.3 70B, Qwen 2.5 72B) across three configurations: Text-Only, Text+Caption, and Multimodal. Performance patterns vary by model and configuration, with Llama showing the lowest errors overall and Qwen showing stronger performance in text-only settings (see Table \ref{['tab:main_results']} for detailed statistics).
Figure 4: Spatial comparison of Ground Truth vs. CrisiSense-RAG predictions for Flood Extent (Top) and Damage Severity (Bottom). Only areas with PDE data are shown here. The model successfully captures the broad spatial distribution of flooding and the specific pockets of severe structural damage across the Greater Houston area.

CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

TL;DR

Abstract

CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (4)