Table of Contents
Fetching ...

Unsupervised Video Summarization via Iterative Training and Simplified GAN

Hanqing Li, Diego Klabjan, Jean Utke

TL;DR

This work addresses unsupervised video summarization by introducing SUM-SR, a discriminator-free model that pairs a frame selector with a reconstructor and trains them via a reconstruction-based objective $L_{recon}$ together with a sparsity term $L_{spar}$. It adds a trainable mask and explores an iterative, part-by-part training regime, along with an unsupervised model-selection framework to pick the best model without ground-truth, achieving strong performance and efficiency. Across SumMe, TVSum, and four new datasets, SUM-SR, especially in its 5-iteration variant, outperforms state-of-the-art unsupervised methods by up to about 9% on average, while reducing training time and model size by removing the discriminator. The approach demonstrates the viability of discriminator-free, iterative training for video summarization and provides practical guidance for applying the method to longer videos via sampling or shot-based processing.

Abstract

This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.

Unsupervised Video Summarization via Iterative Training and Simplified GAN

TL;DR

This work addresses unsupervised video summarization by introducing SUM-SR, a discriminator-free model that pairs a frame selector with a reconstructor and trains them via a reconstruction-based objective together with a sparsity term . It adds a trainable mask and explores an iterative, part-by-part training regime, along with an unsupervised model-selection framework to pick the best model without ground-truth, achieving strong performance and efficiency. Across SumMe, TVSum, and four new datasets, SUM-SR, especially in its 5-iteration variant, outperforms state-of-the-art unsupervised methods by up to about 9% on average, while reducing training time and model size by removing the discriminator. The approach demonstrates the viability of discriminator-free, iterative training for video summarization and provides practical guidance for applying the method to longer videos via sampling or shot-based processing.

Abstract

This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.
Paper Structure (17 sections, 4 equations, 9 figures, 3 tables)

This paper contains 17 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The proposed SUM-SR architecture.
  • Figure 2: The training steps of SUM-SR. During training, an iteration includes one reconstruction and one selection. We train the mask vector only in the first iteration (if there are multiple iterations).
  • Figure 3: Comparison (standard deviation of the F-score) of different methods running multiple times with different random seeds on six datasets.
  • Figure 4: Comparison of per epoch training time (sec/epoch), total training time (seconds) and number of parameters (millions) of different methods in the same computing environment.
  • Figure 5: Comparison (F-score ($\%$)) of different $\sigma$ in SUM-SR$_{sepMa}$ on the TVSum dataset with both unsupervised and supervised (best) model selection methods.
  • ...and 4 more figures