A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods
Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
TL;DR
This work tackles $360^{\circ}$ video summarization by converting panoramic content into short $2$D summaries viewable on standard devices. It introduces the 360-VSumm dataset, derived from the VR-EyeTracking collection, with ground-truth summaries, saliency maps, and an interactive annotation tool to support ground-truth creation. It evaluates two $2$D video summarization models (PGL-SUM and CA-SUM) as baselines and shows that transfer from $2$D to $360^{\circ}$ is weak, while retraining on 360-VSumm and using frame saliency yields measurable gains. The dataset and workflow provide a public, reproducible benchmark that should accelerate development of $360^{\circ}$-specific summarization methods.
Abstract
In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.
