Unsupervised Video Summarization via Iterative Training and Simplified GAN

Hanqing Li; Diego Klabjan; Jean Utke

Unsupervised Video Summarization via Iterative Training and Simplified GAN

Hanqing Li, Diego Klabjan, Jean Utke

TL;DR

This work addresses unsupervised video summarization by introducing SUM-SR, a discriminator-free model that pairs a frame selector with a reconstructor and trains them via a reconstruction-based objective $L_{recon}$ together with a sparsity term $L_{spar}$. It adds a trainable mask and explores an iterative, part-by-part training regime, along with an unsupervised model-selection framework to pick the best model without ground-truth, achieving strong performance and efficiency. Across SumMe, TVSum, and four new datasets, SUM-SR, especially in its 5-iteration variant, outperforms state-of-the-art unsupervised methods by up to about 9% on average, while reducing training time and model size by removing the discriminator. The approach demonstrates the viability of discriminator-free, iterative training for video summarization and provides practical guidance for applying the method to longer videos via sampling or shot-based processing.

Abstract

This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.

Unsupervised Video Summarization via Iterative Training and Simplified GAN

TL;DR

together with a sparsity term

. It adds a trainable mask and explores an iterative, part-by-part training regime, along with an unsupervised model-selection framework to pick the best model without ground-truth, achieving strong performance and efficiency. Across SumMe, TVSum, and four new datasets, SUM-SR, especially in its 5-iteration variant, outperforms state-of-the-art unsupervised methods by up to about 9% on average, while reducing training time and model size by removing the discriminator. The approach demonstrates the viability of discriminator-free, iterative training for video summarization and provides practical guidance for applying the method to longer videos via sampling or shot-based processing.

Abstract

Paper Structure (17 sections, 4 equations, 9 figures, 3 tables)

This paper contains 17 sections, 4 equations, 9 figures, 3 tables.

Introduction
Related Work
Model
Model Structure
Training Strategy
Summarization
Model Selection
Experiments
Datasets
Evaluation
Implementation Details
Results
Ablation and Sensitivity Studies
Conclusion
Model Selection Method
...and 2 more sections

Figures (9)

Figure 1: The proposed SUM-SR architecture.
Figure 2: The training steps of SUM-SR. During training, an iteration includes one reconstruction and one selection. We train the mask vector only in the first iteration (if there are multiple iterations).
Figure 3: Comparison (standard deviation of the F-score) of different methods running multiple times with different random seeds on six datasets.
Figure 4: Comparison of per epoch training time (sec/epoch), total training time (seconds) and number of parameters (millions) of different methods in the same computing environment.
Figure 5: Comparison (F-score ($\%$)) of different $\sigma$ in SUM-SR$_{sepMa}$ on the TVSum dataset with both unsupervised and supervised (best) model selection methods.
...and 4 more figures

Unsupervised Video Summarization via Iterative Training and Simplified GAN

TL;DR

Abstract

Unsupervised Video Summarization via Iterative Training and Simplified GAN

Authors

TL;DR

Abstract

Table of Contents

Figures (9)