HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
Hang Zhao, Antonio Torralba, Lorenzo Torresani, Zhicheng Yan
TL;DR
The paper introduces HACS, a large-scale video dataset with two annotation streams: HACS Clips for action recognition (1.5M labeled 2-second clips from 504K videos) and HACS Segments for dense temporal localization (139K segments in 50K videos across 200 actions). It develops a classifier-consensus/disagreement–driven clip mining pipeline, followed by rigorous human annotation with quality control, and provides a comprehensive evaluation of recognition and localization methods. Key contributions include a thorough study of clip-sampling strategies, state-of-the-art pretraining benefits on downstream benchmarks, and challenging dense localization benchmarks with BSN/TAG and SSN methods. The dataset enables robust transfer learning for recognition and establishes a new, more demanding standard for action localization, potentially accelerating advances in spatiotemporal modeling.
Abstract
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg-ments contains 139K action segments densely annotatedin 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transferlearning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1Mas a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.
