Table of Contents
Fetching ...

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

Lakshita Agarwal, Bindu Verma

TL;DR

The paper tackles video description generation by integrating high-level visual features with a transformer-based language model to produce fluent, context-aware narratives. It proposes a ResNet-50 feature extractor coupled with a GPT-2 encoder-decoder, enhanced by multi-head self-attention and cross-attention to align visual patches with textual output, trained on MSVD and BDD-X with gradient accumulation and mixed precision. Empirical results show the approach achieves strong performance across BLEU-4, CIDEr, METEOR, and ROUGE-L and outperforms several state-of-the-art methods, while ablation studies highlight the importance of visual features and attention mechanisms. The work advances explainable AI in video understanding by delivering interpretable, high-quality descriptions suitable for applications in autonomous systems, surveillance, and robotics, and suggests future improvements in cross-modal attention and domain adaptation.

Abstract

Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

TL;DR

The paper tackles video description generation by integrating high-level visual features with a transformer-based language model to produce fluent, context-aware narratives. It proposes a ResNet-50 feature extractor coupled with a GPT-2 encoder-decoder, enhanced by multi-head self-attention and cross-attention to align visual patches with textual output, trained on MSVD and BDD-X with gradient accumulation and mixed precision. Empirical results show the approach achieves strong performance across BLEU-4, CIDEr, METEOR, and ROUGE-L and outperforms several state-of-the-art methods, while ablation studies highlight the importance of visual features and attention mechanisms. The work advances explainable AI in video understanding by delivering interpretable, high-quality descriptions suitable for applications in autonomous systems, surveillance, and robotics, and suggests future improvements in cross-modal attention and domain adaptation.

Abstract

Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

Paper Structure

This paper contains 10 sections, 6 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Framework for the Proposed Model (ResNet50-GPT2): The system incorporates ResNet50 for image feature extraction and a GPT-2 encoder-decoder model to generate context-aware video-based image descriptions.
  • Figure 2: Demonstration of Ablation Study for the Proposed Work
  • Figure 3: Graphical Representation of the Results Obtained
  • Figure 4: Graphical Representation of the SOTA Methods