Table of Contents
Fetching ...

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

Yuheng Yang, Haipeng Chen, Zhenguang Liu, Yingda Lyu, Beibei Zhang, Shuang Wu, Zhibo Wang, Kui Ren

TL;DR

The paper addresses skeleton-based action recognition by introducing higher-order motion features, specifically joint and bone angular accelerations, to complement traditional joint coordinates. It proposes Stream-GCN, a multi-stream graph convolutional network with cross-channel attention to fuse diverse representations and emphasize task-relevant channels, along with a mutual information objective to supervise feature extraction. Empirically, the approach achieves state-of-the-art results on NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, and ablations confirm the benefits of acceleration streams, attention, and MI supervision. The work advances action recognition by integrating rigid-body kinematics with information-theoretic feature supervision, offering practical improvements for robust pose-based recognition and insights into the role of higher-order motion cues.

Abstract

Action recognition has long been a fundamental and intriguing problem in artificial intelligence. The task is challenging due to the high dimensionality nature of an action, as well as the subtle motion details to be considered. Current state-of-the-art approaches typically learn from articulated motion sequences in the straightforward 3D Euclidean space. However, the vanilla Euclidean space is not efficient for modeling important motion characteristics such as the joint-wise angular acceleration, which reveals the driving force behind the motion. Moreover, current methods typically attend to each channel equally and lack theoretical constrains on extracting task-relevant features from the input. In this paper, we seek to tackle these challenges from three aspects: (1) We propose to incorporate an acceleration representation, explicitly modeling the higher-order variations in motion. (2) We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention, where different representations (i.e., streams) supplement each other towards a more precise action recognition while attention capitalizes on those important channels. (3) We explore feature-level supervision for maximizing the extraction of task-relevant information and formulate this into a mutual information loss. Empirically, our approach sets the new state-of-the-art performance on three benchmark datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Our code is anonymously released at https://github.com/ActionR-Group/Stream-GCN, hoping to inspire the community.

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

TL;DR

The paper addresses skeleton-based action recognition by introducing higher-order motion features, specifically joint and bone angular accelerations, to complement traditional joint coordinates. It proposes Stream-GCN, a multi-stream graph convolutional network with cross-channel attention to fuse diverse representations and emphasize task-relevant channels, along with a mutual information objective to supervise feature extraction. Empirically, the approach achieves state-of-the-art results on NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, and ablations confirm the benefits of acceleration streams, attention, and MI supervision. The work advances action recognition by integrating rigid-body kinematics with information-theoretic feature supervision, offering practical improvements for robust pose-based recognition and insights into the role of higher-order motion cues.

Abstract

Action recognition has long been a fundamental and intriguing problem in artificial intelligence. The task is challenging due to the high dimensionality nature of an action, as well as the subtle motion details to be considered. Current state-of-the-art approaches typically learn from articulated motion sequences in the straightforward 3D Euclidean space. However, the vanilla Euclidean space is not efficient for modeling important motion characteristics such as the joint-wise angular acceleration, which reveals the driving force behind the motion. Moreover, current methods typically attend to each channel equally and lack theoretical constrains on extracting task-relevant features from the input. In this paper, we seek to tackle these challenges from three aspects: (1) We propose to incorporate an acceleration representation, explicitly modeling the higher-order variations in motion. (2) We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention, where different representations (i.e., streams) supplement each other towards a more precise action recognition while attention capitalizes on those important channels. (3) We explore feature-level supervision for maximizing the extraction of task-relevant information and formulate this into a mutual information loss. Empirically, our approach sets the new state-of-the-art performance on three benchmark datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Our code is anonymously released at https://github.com/ActionR-Group/Stream-GCN, hoping to inspire the community.
Paper Structure (12 sections, 12 equations, 5 figures, 6 tables)

This paper contains 12 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The illustration of similar actions and their features at different orders. Left: Showcase two videos of similar actions, "drinking water" and "brushing teeth", the key frames are video clips that are decisive for identifying the action. Right: Take the hand joint as an example, the curves indicate trajectories of motion characteristics for various orders on $z$-dimension.
  • Figure 2: The visualization of instantaneous angle$\theta_i^p$ and instantaneous angular velocity$\omega_{i}^p$.
  • Figure 3: The overall pipeline of our Stream-GCN Network. The goal is to identify the action class by virtue of the motion sequence. For clarity of illustration, we only show one-layer spatial and temporal modeling in this figure. We first absorb input streams from the lower-order and high-order representations. For each stream, the spatial modeling conducts a channel attention module that consists of a set of pooling and convolution operations, yielding the features with channel weights. The temporal modeling adopts multi-scale convolutions, capturing long-range temporal dependencies within a motion sequence. For multi-stream fusion, each stream predicts an action class distribution, which is ensembled to approach the final class distribution.
  • Figure 4: Our mutual information maximizes $I(Y;Z)$, compresses $I(Z;X)$, and preserves $I(Z;Y|X)$. $H(\cdot)$ denotes entropy. The visualization is inspired by yeung1991new.
  • Figure 5: Examples of the learned attention maps at different layers for the drinking water action. The numbers denote different joints, e.g., "number 4" denotes "head" and "number 8" denotes "left hand". The brighter area indicates that the weight of the correlation matrix is larger there, which means the correlation strength between the joints is stronger.