Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior Recognition
Cheng Liu, Xuyang Yan, Zekun Zhang, Cheng Ding, Tianhao Zhao, Shaya Jannati, Cynthia Martinez, Dietrich Stout
TL;DR
The paper introduces HSTAG, a domain-specific video dataset for fine-grained human action recognition in stone toolmaking, comprising 18,739 clipped videos across seven action classes and featuring multi-view capture, tool switching, and rapid action transitions. It details data collection from experts, a BORIS-based annotation pipeline with a seven-action grammar, and a 75:25 train/test split highlighting class imbalances. The authors benchmark three SOTA models (VideoMAEv2, TimeSformer, ResNet+GRU), analyze performance and class separability, and discuss challenges such as high-frequency frames and action similarity. The work positions HSTAG as a challenging benchmark to test generalizability of CV techniques in rarely seen domains and to spur development of methods robust to unbalanced temporal sequences.
Abstract
Action recognition has witnessed the development of a growing number of novel algorithms and datasets in the past decade. However, the majority of public benchmarks were constructed around activities of daily living and annotated at a rather coarse-grained level, which lacks diversity in domain-specific datasets, especially for rarely seen domains. In this paper, we introduced Human Stone Toolmaking Action Grammar (HSTAG), a meticulously annotated video dataset showcasing previously undocumented stone toolmaking behaviors, which can be used for investigating the applications of advanced artificial intelligence techniques in understanding a rapid succession of complex interactions between two hand-held objects. HSTAG consists of 18,739 video clips that record 4.5 hours of experts' activities in stone toolmaking. Its unique features include (i) brief action durations and frequent transitions, mirroring the rapid changes inherent in many motor behaviors; (ii) multiple angles of view and switches among multiple tools, increasing intra-class variability; (iii) unbalanced class distributions and high similarity among different action sequences, adding difficulty in capturing distinct patterns for each action. Several mainstream action recognition models are used to conduct experimental analysis, which showcases the challenges and uniqueness of HSTAG https://nyu.databrary.org/volume/1697.
