Table of Contents
Fetching ...

An Integrated Framework for Multi-Granular Explanation of Video Summarization

Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work addresses the lack of interpretable explanations for video summarization by introducing an integrated multi-granular framework that provides fragment-level and object-level visual explanations. It combines a model-agnostic fragment-level approach (adapted LIME on video fragments) with a novel object-level method using Video K-Net panoptic segmentation and perturbation-based explanations, all within a single framework. The authors validate the method on SumMe and TVSum using CA-SUM as the summarizer, reporting both quantitative metrics and qualitative demonstrations that reveal which fragments and objects most influence the summarizer. The framework has practical impact for media editors and retrieval tasks and lays the groundwork for future extensions, including textual descriptions via vision-language models.

Abstract

In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summarizer). To build this framework, we extend our previous work on this field, by investigating the use of a model-agnostic, perturbation-based approach for fragment-level explanation of the video summarization results, and introducing a new method that combines the results of video panoptic segmentation with an adaptation of a perturbation-based explanation approach to produce object-level explanations. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets for benchmarking video summarization. The findings of the conducted quantitative and qualitative evaluations demonstrate the ability of our framework to spot the most and least influential fragments and visual objects of the video for the summarizer, and to provide a comprehensive set of visual-based explanations about the output of the summarization process.

An Integrated Framework for Multi-Granular Explanation of Video Summarization

TL;DR

This work addresses the lack of interpretable explanations for video summarization by introducing an integrated multi-granular framework that provides fragment-level and object-level visual explanations. It combines a model-agnostic fragment-level approach (adapted LIME on video fragments) with a novel object-level method using Video K-Net panoptic segmentation and perturbation-based explanations, all within a single framework. The authors validate the method on SumMe and TVSum using CA-SUM as the summarizer, reporting both quantitative metrics and qualitative demonstrations that reveal which fragments and objects most influence the summarizer. The framework has practical impact for media editors and retrieval tasks and lays the groundwork for future extensions, including textual descriptions via vision-language models.

Abstract

In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summarizer). To build this framework, we extend our previous work on this field, by investigating the use of a model-agnostic, perturbation-based approach for fragment-level explanation of the video summarization results, and introducing a new method that combines the results of video panoptic segmentation with an adaptation of a perturbation-based explanation approach to produce object-level explanations. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets for benchmarking video summarization. The findings of the conducted quantitative and qualitative evaluations demonstrate the ability of our framework to spot the most and least influential fragments and visual objects of the video for the summarizer, and to provide a comprehensive set of visual-based explanations about the output of the summarization process.
Paper Structure (11 sections, 7 figures, 7 tables)

This paper contains 11 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: High-level overview of our framework for explaining video summarization. This framework produces: i) a fragment-level explanation indicating the most influential video fragments; ii) object-level explanation (#1) highlighting the most influential objects within the most influential fragments; and iii) object-level explanation (#2) highlighting the visual objects within the fragments that have been selected for inclusion in the summary, that influenced the most this selection.
  • Figure 2: The proposed processing pipeline for producing fragment-level explanations.
  • Figure 3: Processing pipeline for producing object-level explanations. The selected video fragments are the most influential according to the fragment-level explanation, or the top-scoring by the summarizer.
  • Figure 4: The computed Disc+ and Disc- scores for the examined fragment-level explanation methods on the videos on the SumMe dataset, after masking out the three top- and bottom-scoring fragments in a one-by-one and sequential manner.
  • Figure 5: The computed Disc+ and Disc- scores for the examined fragment-level explanation methods on the videos on the TVSum dataset, after masking out the three top- and bottom- scoring fragments in a one-by-one and sequential manner.
  • ...and 2 more figures