Table of Contents
Fetching ...

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

Hang Zhao, Antonio Torralba, Lorenzo Torresani, Zhicheng Yan

TL;DR

The paper introduces HACS, a large-scale video dataset with two annotation streams: HACS Clips for action recognition (1.5M labeled 2-second clips from 504K videos) and HACS Segments for dense temporal localization (139K segments in 50K videos across 200 actions). It develops a classifier-consensus/disagreement–driven clip mining pipeline, followed by rigorous human annotation with quality control, and provides a comprehensive evaluation of recognition and localization methods. Key contributions include a thorough study of clip-sampling strategies, state-of-the-art pretraining benefits on downstream benchmarks, and challenging dense localization benchmarks with BSN/TAG and SSN methods. The dataset enables robust transfer learning for recognition and establishes a new, more demanding standard for action localization, potentially accelerating advances in spatiotemporal modeling.

Abstract

This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg-ments contains 139K action segments densely annotatedin 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transferlearning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1Mas a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

TL;DR

The paper introduces HACS, a large-scale video dataset with two annotation streams: HACS Clips for action recognition (1.5M labeled 2-second clips from 504K videos) and HACS Segments for dense temporal localization (139K segments in 50K videos across 200 actions). It develops a classifier-consensus/disagreement–driven clip mining pipeline, followed by rigorous human annotation with quality control, and provides a comprehensive evaluation of recognition and localization methods. Key contributions include a thorough study of clip-sampling strategies, state-of-the-art pretraining benefits on downstream benchmarks, and challenging dense localization benchmarks with BSN/TAG and SSN methods. The dataset enables robust transfer learning for recognition and establishes a new, more demanding standard for action localization, potentially accelerating advances in spatiotemporal modeling.

Abstract

This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg-ments contains 139K action segments densely annotatedin 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transferlearning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1Mas a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.

Paper Structure

This paper contains 18 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparisons of manually labeled action recognition datasets (Top) and action localization datasets (Bottom), where ours are marked as red. The marker size encodes the number of action classes in logarithmic scale.
  • Figure 2: Our pipeline of sparse clip sampling and labeling.
  • Figure 3: Evaluating Res3D-18 and I3D models trained on 3 different Train-mini sets on 4 different validation sets.
  • Figure 4: Examples of dense segment annotations. Action definition is clarified in the guideline to reduce the ambiguity of action boundaries.
  • Figure 5: Action segment annotation tool. A timeline overview is shown below the video player, and a zoom-in view of current time window is shown in the bottom for accurate temporal annotation.
  • ...and 2 more figures