Table of Contents
Fetching ...

SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Zhuoxuan Peng, Yiyi Ding, Yang Lin, S. -H. Gary Chan

Abstract

Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Abstract

Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

Paper Structure

This paper contains 32 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison of video-based HAR pipelines. Our proposed pipeline (c) employs SBF predicted by SFSNet to augment skeleton for effective HAR, addressing the limitations of the skeleton-only approach (b).
  • Figure 2: Failure cases of HAR based on extracting 2D skeletons from videos, addressed with our SBF representation predicted via SFSNet. Top row: original video frame; middle row: skeleton with prediction error; bottom row: correct prediction using skeleton+SBF, owing to the capturing of the action-related information. Red circles highlight key regions possibly leading to the error. (a) Joint depth: The 2D skeleton's flat nature causes the ambiguity of the depth of each joint, e.g., confusing "squatting down" and "sitting down" from a front view. (b) Body contour: The contour of human body, an encircling of all body parts, provides richer features than skeleton. E.g., "reaching into a pocket" is mis-predicted as "drop" because the subject's left hand is not shown overlapping with his body in the skeleton. (c) Human-object interaction (HOI): The skeleton misses interactions between the subject and objects, e.g., "throwing" misinterpreted as "waving" due to the ignored motion between the subject and the thrown object.
  • Figure 3: An example video frame, its extracted skeleton, and our proposed SBF.
  • Figure 4: The overall structure of SFSNet. The flow estimator is pretrained via unsupervised learning.
  • Figure 5: A conceptual example of the "waving" action for our annotation generation method in SPR. In each segmentation task, the white region denotes $\mathcal{P}^{\mathit{pos}}$ (positive labels), the black region represents $\mathcal{P}^{\mathit{neg}}$ (negative labels), and the grey region indicates areas excluded from point sampling.
  • ...and 4 more figures