Table of Contents
Fetching ...

Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning

Tingtian Li, Zixun Sun, Xinyu Xiao

TL;DR

This work tackles unsupervised video highlight detection under audio-absent inference by pretraining a cross-modal model on image–audio pairs and deploying a dedicated Representation Activation Sequence Learning (RASL) module to identify salient moments via top-$k$ activations, complemented by a Symmetric Contrastive Learning (SCL) branch that links visual and audio representations. An auxiliary masked Feature Vector Sequence (FVS) reconstruction task with multitask learning reinforces robust latent representations. The framework enables inference with only visual input while maintaining cross-modal semantics learned during pretraining, and achieves superior or competitive results on YouTube Highlights and TVSum against supervised, weakly supervised, and unsupervised baselines. The approach offers practical benefits for wild video editing scenarios by reducing labeling demands, enabling robust highlight detection across unseen domains, and preserving efficiency with a compact model.

Abstract

Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.

Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning

TL;DR

This work tackles unsupervised video highlight detection under audio-absent inference by pretraining a cross-modal model on image–audio pairs and deploying a dedicated Representation Activation Sequence Learning (RASL) module to identify salient moments via top- activations, complemented by a Symmetric Contrastive Learning (SCL) branch that links visual and audio representations. An auxiliary masked Feature Vector Sequence (FVS) reconstruction task with multitask learning reinforces robust latent representations. The framework enables inference with only visual input while maintaining cross-modal semantics learned during pretraining, and achieves superior or competitive results on YouTube Highlights and TVSum against supervised, weakly supervised, and unsupervised baselines. The approach offers practical benefits for wild video editing scenarios by reducing labeling demands, enabling robust highlight detection across unseen domains, and preserving efficiency with a compact model.

Abstract

Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
Paper Structure (19 sections, 11 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: We evaluate the similarity of feature vectors extracted by the feature extractor liu2021swin from videos in the YouTube Highlights and TVSum datasets and show the number of feature vectors that exhibit similarity to others within the videos. We calculate the mean squared error (MSE) to assess the similarity between the target feature vector and others within a window (30 seconds). The sampling rate is set to 5 frames/second. If the MSE is below the threshold, we consider the two vectors to be similar. The results show that highlights have fewer similar feature vectors.
  • Figure 2: Flowcharts of the weakly supervised visual-audio-based methods ye2021temporalhong2020mini and the proposed unsupervised method. ye2021temporalhong2020mini need both the visual and the audio modalities as the input in the training and inference processes. The proposed method requires the visual and audio modalities as the input during pretraining. During inference, only the visual modality is needed to obtain representations with visual-audio-level semantics from the pretrained model, and the RASL module is used to output the highlight scores.
  • Figure 3: The framework of the proposed highlight detection method. During pretraining, we input the visual and audio clips into the two branches and extract their feature vectors to compose the visual and audio FVSs. Then, we enhance the FVSs of the two modalities via the SA modules. After that, the self-attended FVSs are fed to the autoencoders for self-reconstruction. The significant activations are learned from the RASL module. The paired visual-audio representations are learned through the SCL module. The auxiliary task of masked FVS reconstruction is conducted to improve the performance of the main highlight detection task. During inference, we use only the cross-modal pretrained visual branch and the RASL module to output highlight scores.
  • Figure 4: The illustration of the proposed RASL module, which is shown in the dotted box. The representation vectors $\bm{{r}}^{m}$ output from the encoder are sent to the module, which learns $\bm{{s}}^{m}$ via k-point contrastive learning. Then, the product of $\bm{{s}}^{m}$ and $\bm{{z}}^{m}$ is fed to the decoder for FVS reconstruction.
  • Figure 5: The demonstration of the SCL module. The module maximizes and minimizes the multiplied values of the paired and the unpaired representation vectors, respectively.
  • ...and 3 more figures