Table of Contents
Fetching ...

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification

Shubham Gupta, Nandini Saini, Suman Kundu, Debasis Das

TL;DR

CrisisKAN tackles crisis-event classification from multimodal social media by addressing semantic misalignment between images and text and the lack of interpretability. It integrates external Wikipedia knowledge via a wiki extraction pipeline, uses a guided self-attention informed cross-modal fusion, and provides Grad-CAM–based explanations to improve trust in high-stakes settings. Key contributions include the knowledge infusion approach with a wiki-augmented text representation, a fixed-dimensional cross-modal fusion mechanism, an explainability module, and the MTMS metric for multi-task evaluation. Experiments on CrisisMMD demonstrate state-of-the-art performance across tasks and settings, with qualitative visualizations supporting interpretability. The work advances practical crisis analytics by offering more accurate, explainable, and robust multimodal classification, and proposes avenues for real-time knowledge integration and longer-text handling in future work.

Abstract

Pervasive use of social media has become the emerging source for real-time information (like images, text, or both) to identify various events. Despite the rapid growth of image and text-based event classification, the state-of-the-art (SOTA) models find it challenging to bridge the semantic gap between features of image and text modalities due to inconsistent encoding. Also, the black-box nature of models fails to explain the model's outcomes for building trust in high-stakes situations such as disasters, pandemic. Additionally, the word limit imposed on social media posts can potentially introduce bias towards specific events. To address these issues, we proposed CrisisKAN, a novel Knowledge-infused and Explainable Multimodal Attention Network that entails images and texts in conjunction with external knowledge from Wikipedia to classify crisis events. To enrich the context-specific understanding of textual information, we integrated Wikipedia knowledge using proposed wiki extraction algorithm. Along with this, a guided cross-attention module is implemented to fill the semantic gap in integrating visual and textual data. In order to ensure reliability, we employ a model-specific approach called Gradient-weighted Class Activation Mapping (Grad-CAM) that provides a robust explanation of the predictions of the proposed model. The comprehensive experiments conducted on the CrisisMMD dataset yield in-depth analysis across various crisis-specific tasks and settings. As a result, CrisisKAN outperforms existing SOTA methodologies and provides a novel view in the domain of explainable multimodal event classification.

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification

TL;DR

CrisisKAN tackles crisis-event classification from multimodal social media by addressing semantic misalignment between images and text and the lack of interpretability. It integrates external Wikipedia knowledge via a wiki extraction pipeline, uses a guided self-attention informed cross-modal fusion, and provides Grad-CAM–based explanations to improve trust in high-stakes settings. Key contributions include the knowledge infusion approach with a wiki-augmented text representation, a fixed-dimensional cross-modal fusion mechanism, an explainability module, and the MTMS metric for multi-task evaluation. Experiments on CrisisMMD demonstrate state-of-the-art performance across tasks and settings, with qualitative visualizations supporting interpretability. The work advances practical crisis analytics by offering more accurate, explainable, and robust multimodal classification, and proposes avenues for real-time knowledge integration and longer-text handling in future work.

Abstract

Pervasive use of social media has become the emerging source for real-time information (like images, text, or both) to identify various events. Despite the rapid growth of image and text-based event classification, the state-of-the-art (SOTA) models find it challenging to bridge the semantic gap between features of image and text modalities due to inconsistent encoding. Also, the black-box nature of models fails to explain the model's outcomes for building trust in high-stakes situations such as disasters, pandemic. Additionally, the word limit imposed on social media posts can potentially introduce bias towards specific events. To address these issues, we proposed CrisisKAN, a novel Knowledge-infused and Explainable Multimodal Attention Network that entails images and texts in conjunction with external knowledge from Wikipedia to classify crisis events. To enrich the context-specific understanding of textual information, we integrated Wikipedia knowledge using proposed wiki extraction algorithm. Along with this, a guided cross-attention module is implemented to fill the semantic gap in integrating visual and textual data. In order to ensure reliability, we employ a model-specific approach called Gradient-weighted Class Activation Mapping (Grad-CAM) that provides a robust explanation of the predictions of the proposed model. The comprehensive experiments conducted on the CrisisMMD dataset yield in-depth analysis across various crisis-specific tasks and settings. As a result, CrisisKAN outperforms existing SOTA methodologies and provides a novel view in the domain of explainable multimodal event classification.
Paper Structure (21 sections, 10 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of Twitter example using knowledge enhanced multimodal event classification to address challenges in visual-textual modality
  • Figure 2: The overall architecture of CrisisKAN.
  • Figure 3: Comparative study of visual explanation for CrisisKAN (ours) with baseline model abavisani2020multimodal across various tasks in Setting A.
  • Figure 4: Comparison on different image and text encoders.
  • Figure 5: Comparison on different CrisisKAN Model Settings (MS).