A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

Ioannis Kontostathis; Evlampios Apostolidis; Vasileios Mezaris

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work tackles $360^{\circ}$ video summarization by converting panoramic content into short $2$D summaries viewable on standard devices. It introduces the 360-VSumm dataset, derived from the VR-EyeTracking collection, with ground-truth summaries, saliency maps, and an interactive annotation tool to support ground-truth creation. It evaluates two $2$D video summarization models (PGL-SUM and CA-SUM) as baselines and shows that transfer from $2$D to $360^{\circ}$ is weak, while retraining on 360-VSumm and using frame saliency yields measurable gains. The dataset and workflow provide a public, reproducible benchmark that should accelerate development of $360^{\circ}$-specific summarization methods.

Abstract

In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

TL;DR

This work tackles

video summarization by converting panoramic content into short

D summaries viewable on standard devices. It introduces the 360-VSumm dataset, derived from the VR-EyeTracking collection, with ground-truth summaries, saliency maps, and an interactive annotation tool to support ground-truth creation. It evaluates two

D video summarization models (PGL-SUM and CA-SUM) as baselines and shows that transfer from

D to

is weak, while retraining on 360-VSumm and using frame saliency yields measurable gains. The dataset and workflow provide a public, reproducible benchmark that should accelerate development of

-specific summarization methods.

Abstract

Paper Structure (8 sections, 4 figures, 5 tables)

This paper contains 8 sections, 4 figures, 5 tables.

Introduction
Literature Review
The 360-VSumm Dataset
Experiments
Implementation Details
Quantitative Results
Qualitative Results
Conclusions and Next Steps

Figures (4)

Figure 1: Histograms with the number of events per video (left side) and the number of events running in parallel per video (right side).
Figure 2: The graphical interface of the developed annotation tool.
Figure 3: A frame-based overview of the presented events in the video (top part), and the produced summaries by the best-performing models of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part).
Figure 4: A frame-based overview of the presented events in the video (top part), and the produced summaries by the best-performing models of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part).

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

TL;DR

Abstract

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (4)