Table of Contents
Fetching ...

An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

Thomas Eleftheriadis, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

The paper tackles the challenge of generating plausible textual explanations for video summarization by extending an existing multi-granular explainable framework with LLaVA-OneVision to describe fragment-level visual explanations in natural language. Plausibility is evaluated by measuring semantic overlap between the textual description of the explanation and the textual description of the video summary, using SBERT and SimCSE embeddings, alongside a faithfulness metric (Disc+) based on top-k fragment influence. Experiments on SumMe and TVSum with the CA-SUM method reveal that, for condensed explanations, faithfulness and plausibility can diverge, while for more descriptive explanations (three fragments), faithful explanations tend to be more plausible, especially when using a per-fragment description followed by summarization (Approach 2). The work demonstrates that plausible textual explanations can be generated automatically without human interpretation, providing a practical path toward more transparent multimodal video understanding and guiding design choices for textual explanations in explainable AI for video data.

Abstract

In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.

An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

TL;DR

The paper tackles the challenge of generating plausible textual explanations for video summarization by extending an existing multi-granular explainable framework with LLaVA-OneVision to describe fragment-level visual explanations in natural language. Plausibility is evaluated by measuring semantic overlap between the textual description of the explanation and the textual description of the video summary, using SBERT and SimCSE embeddings, alongside a faithfulness metric (Disc+) based on top-k fragment influence. Experiments on SumMe and TVSum with the CA-SUM method reveal that, for condensed explanations, faithfulness and plausibility can diverge, while for more descriptive explanations (three fragments), faithful explanations tend to be more plausible, especially when using a per-fragment description followed by summarization (Approach 2). The work demonstrates that plausible textual explanations can be generated automatically without human interpretation, providing a practical path toward more transparent multimodal video understanding and guiding design choices for textual explanations in explainable AI for video data.

Abstract

In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.

Paper Structure

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A high-level overview of the extended framework for producing textual explanations for the video summarization results.
  • Figure 2: An overview of the proposed approach for evaluating the plausibility of visual explanations.
  • Figure 3: Top: keyframe-based representation of the content of a TVSum video, titled "Reuben Sandwich with Corned Beef & Sauerkraut". Middle: keyframe-based representations of the video summary and the produced fragment-level explanation by the attention-based explanation method. Bottom: textual descriptions of the video summary and the fragment-level explanation obtained by the integrated LLaVA-OneVision model, and the computed plausibility scores.
  • Figure 4: Top: keyframe-based representation of the content of a TVSum video, titled "Beekeeper". Middle: keyframe-based representations of the video summary and the produced fragment-level explanation by the attention-based explanation method. Bottom: textual descriptions of the video summary and the fragment-level explanation obtained by the integrated LLaVA-OneVision model, and the computed plausibility scores.