Table of Contents
Fetching ...

Video Summarization using Denoising Diffusion Probabilistic Model

Zirui Shang, Yubo Zhu, Hongxi Li, Shuo Yang, Xinxiao Wu

TL;DR

Addressing annotation noise in video summarization, the paper proposes a diffusion-based framework that learns the distribution of summaries via a forward diffusion and a learned reverse denoising with a noise predictor $\epsilon_\theta$. It uses video frame features as guidance and scales ground-truth scores to $[-1,1]$ for training with objective $L=\|\epsilon-\hat{\epsilon}\|^2$. It also integrates an unsupervised summarization model to bootstrap training under data scarcity, enabling effective testing with a reduced number of denoising steps. Empirical results on TVSum, SumMe, and FPVSum show strong performance and improved generalization compared to discriminative baselines.

Abstract

Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.

Video Summarization using Denoising Diffusion Probabilistic Model

TL;DR

Addressing annotation noise in video summarization, the paper proposes a diffusion-based framework that learns the distribution of summaries via a forward diffusion and a learned reverse denoising with a noise predictor . It uses video frame features as guidance and scales ground-truth scores to for training with objective . It also integrates an unsupervised summarization model to bootstrap training under data scarcity, enabling effective testing with a reduced number of denoising steps. Empirical results on TVSum, SumMe, and FPVSum show strong performance and improved generalization compared to discriminative baselines.

Abstract

Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: An example of subjective annotation noise in the TVSum dataset, where yellow blocks represent the annotated video frames of summaries by different annotators.
  • Figure 2: The framework of our method, where the training process shows how the noise predictor network learns to predict noise components, and the testing process shows how to generate accurate importance scores through denoising.
  • Figure 3: The structure of noise predictor network, which uses video frame features $f$ as guidance and noising importance scores $x_t$ as input to predict the noise component at step $t$.
  • Figure 4: Results (F-score) of experiment with different hyper-parameter $t$ on the TVSum and SumMe datasets.
  • Figure 5: Qualitative results of different video summarization methods. The line segments denote the selected segments and the frames are shown below.
  • ...and 1 more figures