Table of Contents
Fetching ...

BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Ariful Islam, Md Rifat Hossen, Md. Mahmudul Arif, Abdullah Al Noman, Md Arifur Rahman

TL;DR

BanglaMM-Disaster tackles real-time disaster classification for Bangla social media by fusing textual descriptions with corresponding images. It proposes an end-to-end multimodal pipeline that couples transformer-based text encoders with CNN visual features via early fusion, evaluated on a new 5,037-sample, 9-class Bangla disaster dataset. The best configuration achieves 83.76% accuracy, surpassing text-only and image-only baselines by 3.84% and 16.91%, respectively, and exhibits cross-modal benefits across disaster categories. The work demonstrates practical potential for rapid disaster monitoring in low-resource settings and suggests future enhancement with attention-based fusion and graph-based cross-modal reasoning.

Abstract

Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

TL;DR

BanglaMM-Disaster tackles real-time disaster classification for Bangla social media by fusing textual descriptions with corresponding images. It proposes an end-to-end multimodal pipeline that couples transformer-based text encoders with CNN visual features via early fusion, evaluated on a new 5,037-sample, 9-class Bangla disaster dataset. The best configuration achieves 83.76% accuracy, surpassing text-only and image-only baselines by 3.84% and 16.91%, respectively, and exhibits cross-modal benefits across disaster categories. The work demonstrates practical potential for rapid disaster monitoring in low-resource settings and suggests future enhancement with attention-based fusion and graph-based cross-modal reasoning.

Abstract

Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Multimodal disaster content from social media.
  • Figure 2: Overview of the proposed multimodal disaster classification framework.
  • Figure 3: Confusion matrix for best visual model (ResNet50).
  • Figure 4: Confusion matrix for best text model (XLM-RoBERTa).
  • Figure 5: Confusion matrix for best multimodal model (mBERT+ResNet50).
  • ...and 1 more figures