Table of Contents
Fetching ...

Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Uttaran Bhattacharya, Gang Wu, Stefano Petrangeli, Viswanathan Swaminathan, Dinesh Manocha

TL;DR

A method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched, with an absolute improvement of 2-4% in the mean average precision of the detected highlights.

Abstract

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.

Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

TL;DR

A method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched, with an absolute improvement of 2-4% in the mean average precision of the detected highlights.

Abstract

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.
Paper Structure (24 sections, 9 equations, 5 figures, 7 tables)

This paper contains 24 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: User-Specific Highlights for a Variety of Users and Target Videos. For each user, we consider a set of highlight clips denoting their individual preferences (left) and detect highlights for them (right) on different target videos (center top and bottom). Given the users' overall highlight preferences, our method employs a multi-head attention (MHA) mechanism to learn which segments of the target videos are relevant highlights based on feature similarities between them (center block). For example, our method learns that user A prefers watching cooking and workout videos. Therefore, given a target video containing cooking and eating, our method identifies that only the cooking segments are relevant between the preferred clips and the target videos, and detects those as highlights. Similarly, our method learns that user B prefers skating and surfing videos. Therefore, given a video containing parkour and skating activities, our method detects only the skating segments as highlights. Overall, our method significantly advances the state-of-the-art in user-specific highlight detection given a diverse, large-scale dataset of user preferences and target videos.
  • Figure 2: Our User-Specific Highlight Detection Network. For each preferred clip $i$, we use the two priming blocks to map the object-based features and the pose-based features to respective features $z_i$ and $y_i$. We use these features to learn the per-frame weights $w_i$ and $v_i$ using multi-head attention (MHA), perform per-frame attention pooling, learn the per-clip weights $\sigma_i$ and $\rho_i$ using MHA again, and fuse the per-clip features using weighted summation to get the fused features $\psi_i$ and $\phi_i$. For each target video $\tau$, we train a separate set of attention priming and MHA layers to obtain fused features $\psi_\tau$ and $\phi_\tau$. We compute the similarities between the fused features of the preferred clips and the target video using scaled matrix products and concatenate and map the resultant features to per-frame highlight scores for the target video using a fully-connected prediction block.
  • Figure 3: Priming and MHA for Objects. Priming on the YOLOv5 yolov5 features using 3D convolutions (blue blocks) and 3D batch norms (green blocks). We use the attention-primed features to learn the per-frame attention weights $w_i$ and $w_\tau$ using fully-connected layers (orange blocks).
  • Figure 4: Priming and MHA for Poses. Priming on the Detectron2 detectron2 features using spatial temporal graph convolutions with feature pooling (green arrow) on the five kinematic chains: trunk, two arms and two legs. We use the attention-primed features to learn the per-frame attention weights $v_i$ and $v_\tau$ using fully-connected layers (orange blocks).
  • Figure 5: Qualitative Results. We show qualitative results of our method and the current best baseline of HighlightMe highlightme for four users in the testing set of PHD$^2$phd2. For each user, we show sample frames of (i) their preferred clips, (ii) ground-truth highlights selected by them from their target videos, (iii) highlights detected by HighlightMe highlightme, and (iv) highlights detected by our method. For each user, we observe that our method matches the ground-truth more closely than HighlightMe highlightme.