Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

Mehryar Abbasi; Hadi Hadizadeh; Parvaneh Saeedi

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

Mehryar Abbasi, Hadi Hadizadeh, Parvaneh Saeedi

TL;DR

TR-SUM tackles unsupervised video summarization by replacing adversarial training with a two-stage, generator-guided reinforcement learning pipeline. A self-supervised generator provides a reconstruction-based reward, guiding a transformer-based summarizer to select frames that maximize reconstruction fidelity, with a per-video baseline for stability. The approach yields state-of-the-art F-scores on SumMe and TVSum (SumMe: 54.5, TVSum: 62.3) and demonstrates superior training stability and efficiency compared with GAN-based methods. These results suggest reconstruction fidelity is a strong proxy for informativeness and that the proposed generator-based reward can robustly align automatic summaries with human judgments.

Abstract

This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective.

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 11 equations, 12 figures, 8 tables, 2 algorithms.

Introduction
Related Works
Approach
Encoding and video segmentation
Generator architecture and training
Summarizer's architecture and training
Inference and summary generation
Relative Comparison with Prior Works
Experimental Results
Datasets and the evaluation method
Implementation setup
Comparison against the state-of-the-art methods
Quantitative Performance Comparison
Computational Efficiency and Stability Analysis
Ablation Study
...and 7 more sections

Figures (12)

Figure 1: System flowchart: A) Input video is embedded, frames are shot labeled, and the sequence is broken down into segments. B) Segments are randomly masked based on the shot labels for self-supervised generator training. C) The trained generator is used to train the summarizer via reinforcement learning. D) The summarizer assigns scores to the embedding sub-sequences, which are combined to create a frame score sequence for the generated video summary.
Figure 2: The architectures of A) generator B) summarizer.
Figure 3: Simulating the effect of different $\delta$ on each fold's F-score using Gaussian-sampled frame scores with varying median centers representing $\delta$.
Figure 4: Comparison of training loss curves for (a) SUM-GAN-AAE apostolidis2020unsupervised, (b) AC-SUM-GAN apostolidis2020ac, and (c) TR-SUM over 100 epochs on TVSum dataset.
Figure 5: Effect of $L$ on F-score, $\rho$, and $\tau$.
...and 7 more figures

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

TL;DR

Abstract

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

Authors

TL;DR

Abstract

Table of Contents

Figures (12)