Table of Contents
Fetching ...

R2RAG-Flood: A reasoning-reinforced training-free retrieval augmentation generation framework for flood damage nowcasting

Lipai Huang, Kai Yin, Chia-Fu Liu, Ali Mostafavi

TL;DR

R2RAG-Flood tackles post-storm Property Damage Extent nowcasting by presenting a training-free retrieval-augmented generation pipeline that leverages a divergence-informed, reasoning-centric knowledge base built from labeled tabular records. The framework converts predictors into text-mode summaries, generates model-based reasoning trajectories, and uses context from geospatial neighbors and prototypes to guide predictions, with a rule-based downgrade to curb over-prediction. In a Harris County Harvey case study, seven LLM backbones achieve accuracy near a supervised baseline while offering explicit, structured reasoning at inference and substantially improved efficiency for lightweight models. The work introduces instance-level reasoning metrics and a downgrade mechanism, enabling practical deployment along a cost–accuracy frontier and highlighting directions for broader validation and domain-expert integration.

Abstract

R2RAG-Flood is a reasoning-reinforced, training-free retrieval-augmented generation framework for post-storm property damage nowcasting. Building on an existing supervised tabular predictor, the framework constructs a reasoning-centric knowledge base composed of labeled tabular records, where each sample includes structured predictors, a compact natural language text-mode summary, and a model-generated reasoning trajectory. During inference, R2RAG-Flood issues context-augmented prompts that retrieve and condition on relevant reasoning trajectories from nearby geospatial neighbors and canonical class prototypes, enabling the large language model backbone to emulate and adapt prior reasoning rather than learn new task-specific parameters. Predictions follow a two-stage procedure that first determines property damage occurrence and then refines severity within a three-level Property Damage Extent categorization, with a conditional downgrade step to correct over-predicted severity. In a case study of Harris County, Texas at the 12-digit Hydrologic Unit Code scale, the supervised tabular baseline trained directly on structured predictors achieves 0.714 overall accuracy and 0.859 damage class accuracy for medium and high damage classes. Across seven large language model backbones, R2RAG-Flood attains 0.613 to 0.668 overall accuracy and 0.757 to 0.896 damage class accuracy, approaching the supervised baseline while additionally producing a structured rationale for each prediction. Using a severity-per-cost efficiency metric derived from API pricing and GPU instance costs, lightweight R2RAG-Flood variants demonstrate substantially higher efficiency than both the supervised tabular baseline and larger language models, while requiring no task-specific training or fine-tuning.

R2RAG-Flood: A reasoning-reinforced training-free retrieval augmentation generation framework for flood damage nowcasting

TL;DR

R2RAG-Flood tackles post-storm Property Damage Extent nowcasting by presenting a training-free retrieval-augmented generation pipeline that leverages a divergence-informed, reasoning-centric knowledge base built from labeled tabular records. The framework converts predictors into text-mode summaries, generates model-based reasoning trajectories, and uses context from geospatial neighbors and prototypes to guide predictions, with a rule-based downgrade to curb over-prediction. In a Harris County Harvey case study, seven LLM backbones achieve accuracy near a supervised baseline while offering explicit, structured reasoning at inference and substantially improved efficiency for lightweight models. The work introduces instance-level reasoning metrics and a downgrade mechanism, enabling practical deployment along a cost–accuracy frontier and highlighting directions for broader validation and domain-expert integration.

Abstract

R2RAG-Flood is a reasoning-reinforced, training-free retrieval-augmented generation framework for post-storm property damage nowcasting. Building on an existing supervised tabular predictor, the framework constructs a reasoning-centric knowledge base composed of labeled tabular records, where each sample includes structured predictors, a compact natural language text-mode summary, and a model-generated reasoning trajectory. During inference, R2RAG-Flood issues context-augmented prompts that retrieve and condition on relevant reasoning trajectories from nearby geospatial neighbors and canonical class prototypes, enabling the large language model backbone to emulate and adapt prior reasoning rather than learn new task-specific parameters. Predictions follow a two-stage procedure that first determines property damage occurrence and then refines severity within a three-level Property Damage Extent categorization, with a conditional downgrade step to correct over-predicted severity. In a case study of Harris County, Texas at the 12-digit Hydrologic Unit Code scale, the supervised tabular baseline trained directly on structured predictors achieves 0.714 overall accuracy and 0.859 damage class accuracy for medium and high damage classes. Across seven large language model backbones, R2RAG-Flood attains 0.613 to 0.668 overall accuracy and 0.757 to 0.896 damage class accuracy, approaching the supervised baseline while additionally producing a structured rationale for each prediction. Using a severity-per-cost efficiency metric derived from API pricing and GPU instance costs, lightweight R2RAG-Flood variants demonstrate substantially higher efficiency than both the supervised tabular baseline and larger language models, while requiring no task-specific training or fine-tuning.
Paper Structure (36 sections, 20 equations, 3 figures, 7 tables)

This paper contains 36 sections, 20 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the R2RAG-Flood workflow. D denotes the variable reference dictionary, and P denotes the divergence-informed feature profile, including the feature ordering and the feature-level divergence table. Stage I derives D, P, and downgrade rules through preprocessing and feature-distribution divergence analysis. Stage II builds a reasoning-centric knowledge base by generating reasoning trajectories and selecting free-shots (prototypes and hard-boundary cases), using HUC12-level libraries with a global fallback. Stage III performs context-augmented prediction by retrieving neighbor reasoning and free-shots and applying downgrade rules when hard-gate violations or contradictions occur. In Stage II and Stage III, color shading indicates the execution order from light to dark.
  • Figure 2: Study area and Harris County PDE category distribution.
  • Figure 3: Ablation study of macro-F1 for PDE category prediction across seven LLM backbones. I: baseline prediction using only the target text mode. II: I + retrieved neighbor reasoning. III: II + conditional free-shots. IV: III + downgrade mechanism.