Table of Contents
Fetching ...

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work tackles $360^{\circ}$ video summarization by converting panoramic content into short $2$D summaries viewable on standard devices. It introduces the 360-VSumm dataset, derived from the VR-EyeTracking collection, with ground-truth summaries, saliency maps, and an interactive annotation tool to support ground-truth creation. It evaluates two $2$D video summarization models (PGL-SUM and CA-SUM) as baselines and shows that transfer from $2$D to $360^{\circ}$ is weak, while retraining on 360-VSumm and using frame saliency yields measurable gains. The dataset and workflow provide a public, reproducible benchmark that should accelerate development of $360^{\circ}$-specific summarization methods.

Abstract

In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

TL;DR

This work tackles video summarization by converting panoramic content into short D summaries viewable on standard devices. It introduces the 360-VSumm dataset, derived from the VR-EyeTracking collection, with ground-truth summaries, saliency maps, and an interactive annotation tool to support ground-truth creation. It evaluates two D video summarization models (PGL-SUM and CA-SUM) as baselines and shows that transfer from D to is weak, while retraining on 360-VSumm and using frame saliency yields measurable gains. The dataset and workflow provide a public, reproducible benchmark that should accelerate development of -specific summarization methods.

Abstract

In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.
Paper Structure (8 sections, 4 figures, 5 tables)

This paper contains 8 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Histograms with the number of events per video (left side) and the number of events running in parallel per video (right side).
  • Figure 2: The graphical interface of the developed annotation tool.
  • Figure 3: A frame-based overview of the presented events in the video (top part), and the produced summaries by the best-performing models of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part).
  • Figure 4: A frame-based overview of the presented events in the video (top part), and the produced summaries by the best-performing models of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part).