Table of Contents
Fetching ...

MEGL: Multimodal Explanation-Guided Learning

Yifei Zhang, Tianxu Jiang, Bo Pan, Jingyu Wang, Guangji Bai, Liang Zhao

TL;DR

This work proposes a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance, and validate MEGL on two new datasets, Object-ME and Action-ME.

Abstract

Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their "black box" nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.

MEGL: Multimodal Explanation-Guided Learning

TL;DR

This work proposes a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance, and validate MEGL on two new datasets, Object-ME and Action-ME.

Abstract

Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their "black box" nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.

Paper Structure

This paper contains 20 sections, 13 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Comparison of visual and textual explanations for an image classification task: The visual explanation highlights key regions of interest in the image but lacks a semantic rationale, while the textual explanation provides reasoning behind the decision but lacks spatial context for the key regions.
  • Figure 2: Overview of the MEGL Framework. The framework is jointly trained to optimize prediction accuracy, visual explainability, and textual explainability. (a) illustrates the prediction process, where the input image is processed by the classifier (comprising a feature extractor and a linear layer) to predict the label and extract the image’s visual features. In (c), a saliency map is generated by the visual explanation method as a visual explanation and is used to compute either the visual explanation loss with ground-truth annotations or the distribution consistency loss with the aggregated set of ground-truth visual explanations. In (b), the visual representations of the image and its saliency-based explanation, encoded by a vision encoder, are projected and input into an LLM to generate a textual explanation, supervised by an autoregressive loss. The text in red (corresponding to the red regions in the saliency map) showcases how visual cues derived from the saliency map are integrated into the process of generating textual explanations.