An Integrated Framework for Multi-Granular Explanation of Video Summarization
Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris
TL;DR
This work addresses the lack of interpretable explanations for video summarization by introducing an integrated multi-granular framework that provides fragment-level and object-level visual explanations. It combines a model-agnostic fragment-level approach (adapted LIME on video fragments) with a novel object-level method using Video K-Net panoptic segmentation and perturbation-based explanations, all within a single framework. The authors validate the method on SumMe and TVSum using CA-SUM as the summarizer, reporting both quantitative metrics and qualitative demonstrations that reveal which fragments and objects most influence the summarizer. The framework has practical impact for media editors and retrieval tasks and lays the groundwork for future extensions, including textual descriptions via vision-language models.
Abstract
In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summarizer). To build this framework, we extend our previous work on this field, by investigating the use of a model-agnostic, perturbation-based approach for fragment-level explanation of the video summarization results, and introducing a new method that combines the results of video panoptic segmentation with an adaptation of a perturbation-based explanation approach to produce object-level explanations. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets for benchmarking video summarization. The findings of the conducted quantitative and qualitative evaluations demonstrate the ability of our framework to spot the most and least influential fragments and visual objects of the video for the summarizer, and to provide a comprehensive set of visual-based explanations about the output of the summarization process.
