Table of Contents
Fetching ...

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

Mehryar Abbasi, Hadi Hadizadeh, Parvaneh Saeedi

TL;DR

TR-SUM tackles unsupervised video summarization by replacing adversarial training with a two-stage, generator-guided reinforcement learning pipeline. A self-supervised generator provides a reconstruction-based reward, guiding a transformer-based summarizer to select frames that maximize reconstruction fidelity, with a per-video baseline for stability. The approach yields state-of-the-art F-scores on SumMe and TVSum (SumMe: 54.5, TVSum: 62.3) and demonstrates superior training stability and efficiency compared with GAN-based methods. These results suggest reconstruction fidelity is a strong proxy for informativeness and that the proposed generator-based reward can robustly align automatic summaries with human judgments.

Abstract

This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective.

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

TL;DR

TR-SUM tackles unsupervised video summarization by replacing adversarial training with a two-stage, generator-guided reinforcement learning pipeline. A self-supervised generator provides a reconstruction-based reward, guiding a transformer-based summarizer to select frames that maximize reconstruction fidelity, with a per-video baseline for stability. The approach yields state-of-the-art F-scores on SumMe and TVSum (SumMe: 54.5, TVSum: 62.3) and demonstrates superior training stability and efficiency compared with GAN-based methods. These results suggest reconstruction fidelity is a strong proxy for informativeness and that the proposed generator-based reward can robustly align automatic summaries with human judgments.

Abstract

This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective.
Paper Structure (22 sections, 11 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 11 equations, 12 figures, 8 tables, 2 algorithms.

Figures (12)

  • Figure 1: System flowchart: A) Input video is embedded, frames are shot labeled, and the sequence is broken down into segments. B) Segments are randomly masked based on the shot labels for self-supervised generator training. C) The trained generator is used to train the summarizer via reinforcement learning. D) The summarizer assigns scores to the embedding sub-sequences, which are combined to create a frame score sequence for the generated video summary.
  • Figure 2: The architectures of A) generator B) summarizer.
  • Figure 3: Simulating the effect of different $\delta$ on each fold's F-score using Gaussian-sampled frame scores with varying median centers representing $\delta$.
  • Figure 4: Comparison of training loss curves for (a) SUM-GAN-AAE apostolidis2020unsupervised, (b) AC-SUM-GAN apostolidis2020ac, and (c) TR-SUM over 100 epochs on TVSum dataset.
  • Figure 5: Effect of $L$ on F-score, $\rho$, and $\tau$.
  • ...and 7 more figures