Table of Contents
Fetching ...

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

Hongwu Li, Zhenliang Zhang, Wei Wang

TL;DR

This work defines the problem of recognizing human habitual behaviors (HHBs) and introduces HabitAction, a large-scale HHB video dataset with 30 categories, about 300k frames, and 6,899 action instances, annotated frame-by-frame with per-person skeletons and bounding boxes via DC-pose. To address the subtle, localized nature of HHBs, the authors propose a two-stream recognition model that fuses skeleton-based features (AGC-LSTM with self-attention) and RGB appearance (SlowFast), augmented by an action mask that emphasizes relevant local regions. Experiments show the two-stream model with the learned mask significantly outperforms single-channel baselines and large pre-trained video models on the HHB dataset, achieving around 84% accuracy, with the mask contributing roughly a 4% gain. The dataset and approach together enable more nuanced video understanding of personality, habits, and psychological states, with broad implications for HCI, affective computing, and behavior analysis.

Abstract

Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

TL;DR

This work defines the problem of recognizing human habitual behaviors (HHBs) and introduces HabitAction, a large-scale HHB video dataset with 30 categories, about 300k frames, and 6,899 action instances, annotated frame-by-frame with per-person skeletons and bounding boxes via DC-pose. To address the subtle, localized nature of HHBs, the authors propose a two-stream recognition model that fuses skeleton-based features (AGC-LSTM with self-attention) and RGB appearance (SlowFast), augmented by an action mask that emphasizes relevant local regions. Experiments show the two-stream model with the learned mask significantly outperforms single-channel baselines and large pre-trained video models on the HHB dataset, achieving around 84% accuracy, with the mask contributing roughly a 4% gain. The dataset and approach together enable more nuanced video understanding of personality, habits, and psychological states, with broad implications for HCI, affective computing, and behavior analysis.

Abstract

Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.
Paper Structure (15 sections, 2 equations, 8 figures, 2 tables)

This paper contains 15 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: 30 categories of human habitual behaviors collected in the proposed dataset. These behaviors generally occur unconsciously and reflect human internal mental states and emotions. Understanding and recognizing habitual behaviors are seldomly explored in previous work.
  • Figure 2: The construction process and data structure of the proposed dataset involve the following steps. Firstly, we utilize the DC-pose model liu2019learning to obtain the bounding box (bbox) and skeleton information for each person in the collected videos. Then we assign action categories to each individual's motions. In the final dataset, we have the positions of each individual, their corresponding skeleton information, and the associated action categories.
  • Figure 3: An annotation example of one video clip in the dataset. (a) The original video. (b) The video with human bounding boxes. (c) The video with skeleton information. The two persons are annotated and saved as "P1_rubhands" and "P2_crosslegs&touchear".
  • Figure 4: The statistical characteristics of the proposed dataset. The proposed dataset consists of various action categories along with the corresponding sample counts and potential associated emotion categories. (a) Statistical characteristics of the proposed dataset. (b) The emotional attributes reflected by the HHBs.
  • Figure 5: Comparison of skeleton extraction performance across different models. (a) The results obtained through the mmaction2 2020mmaction2 from the mmlab and DC-pose model liu2021deep. There are lots of erroneous predictions in the former results, while the latter have more robust performances. (b) The detection results when using the Mediapipe tool from CVZone and DC-pose model liu2021deep. (c) Skeleton extraction results using the DC-pose model liu2021deep in a multi-person scene. In this work, we choose the DC-pose to extract the main joints due to its robustness to complex background.
  • ...and 3 more figures