HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

Alexandru Bobe; Jan C. van Gemert

HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

Alexandru Bobe, Jan C. van Gemert

TL;DR

This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process, and uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features.

Abstract

Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding.

HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

TL;DR

Abstract

Paper Structure (23 sections, 8 figures)

This paper contains 23 sections, 8 figures.

Introduction
Related Work
Video Understanding.
Temporal Action Localization.
Temporal Video Annotations.
Feature Extraction.
Dimensionality Reduction.
The Annotation Pipeline
Feature Extraction
t-distributed Stochastic Neighbor Embedding (t-SNE)
Hierarchical Stochastic Neighbor Embedding (HSNE)
Annotation Platform
Experiments
Datasets & Features Used
Synthetic Features.
...and 8 more sections

Figures (8)

Figure 1: An example of temporal video annotation. The goal of this task is to create text annotations which identify the actions happening at each moment in the input video. These annotations can be further used to train new models or by domain experts for analysis.
Figure 2: The figure depicts an overview of our annotation pipeline. (1) The model-agnostic pre-extracted features are used as input in the analysis. (2) The HSNE (Hierarchical Stochastic Neighbor Embedding) analysis is created. (3) Human-in-the-loop approach for digging into the hierarchy and annotating at the deepest scale. (4) Temporal Action Annotations are outputted in JSON format.
Figure 3: A visualisation of how the landmarks and scales work in HSNE. The area of influence of the landmarks in Scale S can be seen in Scale 1. The number of intermediate scales varies depending on the user-specified S parameter.
Figure 4: Estimated effort in clicks needed to annotate the test set of THUMOS14 using our pipeline and a traditional linear method. The results show a more than 10 times improvement when using our method.
Figure 5: An estimation of effort required to annotate the test set of THUMOS14. Perfect features represent the synthetic features, created as explained in \ref{['sf']}. The THUMOS14 features correspond to the test set features of the THUMOS14 dataset. The plot shows a 50% possible improvement in terms of effort achieved by increasing the feature quality.
...and 3 more figures

HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

TL;DR

Abstract

HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

Authors

TL;DR

Abstract

Table of Contents

Figures (8)