Table of Contents
Fetching ...

HighlightMe: Detecting Highlights from Human-Centric Videos

Uttaran Bhattacharya, Gang Wu, Stefano Petrangeli, Viswanathan Swaminathan, Dinesh Manocha

TL;DR

A domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos that observes a 4–12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.

Abstract

We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.

HighlightMe: Detecting Highlights from Human-Centric Videos

TL;DR

A domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos that observes a 4–12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.

Abstract

We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Detecting highlight excerpts using human-centric modalities. Our method leverages multiple human-centric modalities, e.g., body poses and faces, observable in videos focusing on human activities, to detect highlights. We use a 2D or 3D interconnected point representation of each modality to construct a spatial-temporal graph representation to compute the highlight scores.
  • Figure 2: Representativeness. We show frames with different values of representativeness calculated in the space of poses (left) and face landmarks (right). We learn highlight scores based on the representativeness.
  • Figure 3: Highlight detection with human-centric modalities: Overview of our network for learning highlight scores from multiple human-centric modalities. We use standard techniques mptface_landmark_detect to detect the human-centric modalities. We represent the modalities as sets of connected points in either 2D or 3D. We train the networks for all the modalities in parallel. The only point of interaction between the networks is their predicted highlight scores, which we combine into our weighted highlight score for training.
  • Figure 4: Average precision by highlight score threshold $h_{\textrm{thres}}$. On the domains in the DSH dataset lsvm_dsh.
  • Figure 5: Sample highlight frames detected by our method. We show sample frames across the range of highlight scores as detected by different ablated versions of our method. We show one sample video from the datasets SumMe summe, PHD$^2$phd2, DSH lsvm_dsh, and TVSum tvsum, in order from top to bottom. When using only faces or only poses, our method learns highlight scores based only on face- or pose-based representativeness. Combining both the modalities, our method learns highlight scores based on representativeness from both.