Table of Contents
Fetching ...

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel, Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong

TL;DR

The paper tackles the difficulty of assessing underground mining disasters under extreme visual degradation by introducing MDSE, a multimodal framework that generates detailed, context-aware textual explanations. It combines segmentation-aware dual-pathway visual encoding with a context-aware cross-attention mechanism and uses Low-Rank Adaptation to efficiently fine-tune a language model. A new Underground Mine Disaster (UMD) dataset supports domain-specific training and evaluation, and MDSE shows superior captioning and retrieval performance on UMD, Incidents1M, and standard benchmarks. The approach enhances situational awareness for underground emergency response while maintaining computational efficiency, making it practical for real-time guidance. Overall, MDSE advances vision–language reasoning in highly degraded, domain-specific environments and demonstrates strong generalization across related disaster-focused tasks.

Abstract

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

TL;DR

The paper tackles the difficulty of assessing underground mining disasters under extreme visual degradation by introducing MDSE, a multimodal framework that generates detailed, context-aware textual explanations. It combines segmentation-aware dual-pathway visual encoding with a context-aware cross-attention mechanism and uses Low-Rank Adaptation to efficiently fine-tune a language model. A new Underground Mine Disaster (UMD) dataset supports domain-specific training and evaluation, and MDSE shows superior captioning and retrieval performance on UMD, Incidents1M, and standard benchmarks. The approach enhances situational awareness for underground emergency response while maintaining computational efficiency, making it practical for real-time guidance. Overall, MDSE advances vision–language reasoning in highly degraded, domain-specific environments and demonstrates strong generalization across related disaster-focused tasks.

Abstract

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

Paper Structure

This paper contains 12 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Representative scenes depicting disaster scenarios in underground mining environments.
  • Figure 2: Overview of the proposed MDSE framework and its main components.
  • Figure 3: Representative samples from the UMD dataset with corresponding textual annotations.
  • Figure 4: MDSE inference results on UMDD and DNICC19k datasets, showing its captioning performance under challenging conditions.
  • Figure 5: Demo results of MDSE on the Incidents1M dataset.