Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence
Zahidul Islam, Sujoy Paul, Mrigank Rochan
TL;DR
This work tackles unsupervised video highlight detection without manual annotations by exploiting the recurrence of salient moments across videos of the same category in both audio and visual modalities. It constructs pseudo-categories via clustering of aggregated audio-visual features, then derives audio, visual, and audio-visual pseudo-highlights based on cross-video recurrence to supervise an audio-visual highlight detector that uses unimodal self-attention and bimodal cross-attention. The approach achieves state-of-the-art results among unsupervised methods and competitive performance with weakly supervised baselines across YouTube Highlights, TVSum, and QVHighlights, with extensive ablations demonstrating the benefits of audio cues, cross-modal fusion, and pseudo-categories. By reducing reliance on labeled data and highlighting the informative role of audio, the method offers a scalable pathway for robust, cross-modal video understanding in real-world settings.
Abstract
With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.
