Table of Contents
Fetching ...

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

Zahidul Islam, Sujoy Paul, Mrigank Rochan

TL;DR

This work tackles unsupervised video highlight detection without manual annotations by exploiting the recurrence of salient moments across videos of the same category in both audio and visual modalities. It constructs pseudo-categories via clustering of aggregated audio-visual features, then derives audio, visual, and audio-visual pseudo-highlights based on cross-video recurrence to supervise an audio-visual highlight detector that uses unimodal self-attention and bimodal cross-attention. The approach achieves state-of-the-art results among unsupervised methods and competitive performance with weakly supervised baselines across YouTube Highlights, TVSum, and QVHighlights, with extensive ablations demonstrating the benefits of audio cues, cross-modal fusion, and pseudo-categories. By reducing reliance on labeled data and highlighting the informative role of audio, the method offers a scalable pathway for robust, cross-modal video understanding in real-world settings.

Abstract

With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

TL;DR

This work tackles unsupervised video highlight detection without manual annotations by exploiting the recurrence of salient moments across videos of the same category in both audio and visual modalities. It constructs pseudo-categories via clustering of aggregated audio-visual features, then derives audio, visual, and audio-visual pseudo-highlights based on cross-video recurrence to supervise an audio-visual highlight detector that uses unimodal self-attention and bimodal cross-attention. The approach achieves state-of-the-art results among unsupervised methods and competitive performance with weakly supervised baselines across YouTube Highlights, TVSum, and QVHighlights, with extensive ablations demonstrating the benefits of audio cues, cross-modal fusion, and pseudo-categories. By reducing reliance on labeled data and highlighting the informative role of audio, the method offers a scalable pathway for robust, cross-modal video understanding in real-world settings.

Abstract

With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.
Paper Structure (11 sections, 3 equations, 3 figures, 13 tables)

This paper contains 11 sections, 3 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: a) Visual Recurrence: Highlight of skating videos mostly consist of jump tricks, which appear frequently in multiple videos (first and second rows). Similarly, in cooking videos, close-up shots of food depicting various actions, such as chopping and pan-frying, are commonly appearing highlight moments (third and fourth rows). b) Audio Recurrence: In gymnastics videos, loud cheers and claps are recurring audio cues, which occur when the spectators react to interesting and highlight-worthy moves such as flips or cartwheels. Note that the highlight clips marked in green are also the annotated ground-truth highlights of the example videos in the benchmark datasets.
  • Figure 2: An overview of our unsupervised highlight detection framework. We first extract visual and audio features from each clip of a video. We then use a clustering technique to identify pseudo-categories of the videos. Next, we compare each clip of a video with all the clips across videos of the same pseudo-category using both audio and visual features to obtain audio-visual pseudo-highlight (AV-PH) of the video. Using the audio-visual pseudo-highlight as supervision, we train our audio-visual highlight detection network to assign a highlight score to each video clip. We pick the high scoring clips to obtain the highlight of the video.
  • Figure 3: Qualitative results. We show the highlight clips (in green) along with the predicted scores of our method (in blue), with the ground truth highlights regions (indicated in yellow). For the dog show video from the YouTube dataset (top), our method correctly picks clips with interesting acrobatic movements such as jumping over obstacles. For the video depicting making sandwich from TVSum (bottom), our method accurately detects the highlight moments, which mostly consist of close-up shots of food and important stages of cooking.