HabitAction: A Video Dataset for Human Habitual Behavior Recognition
Hongwu Li, Zhenliang Zhang, Wei Wang
TL;DR
This work defines the problem of recognizing human habitual behaviors (HHBs) and introduces HabitAction, a large-scale HHB video dataset with 30 categories, about 300k frames, and 6,899 action instances, annotated frame-by-frame with per-person skeletons and bounding boxes via DC-pose. To address the subtle, localized nature of HHBs, the authors propose a two-stream recognition model that fuses skeleton-based features (AGC-LSTM with self-attention) and RGB appearance (SlowFast), augmented by an action mask that emphasizes relevant local regions. Experiments show the two-stream model with the learned mask significantly outperforms single-channel baselines and large pre-trained video models on the HHB dataset, achieving around 84% accuracy, with the mask contributing roughly a 4% gain. The dataset and approach together enable more nuanced video understanding of personality, habits, and psychological states, with broad implications for HCI, affective computing, and behavior analysis.
Abstract
Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.
