Table of Contents
Fetching ...

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

Ahmed Abdelkawy, Asem Ali, Aly Farag

TL;DR

The paper tackles real-time multimodal human action recognition by combining RGB and skeleton modalities through an efficient architecture. It introduces X-ShiftNet, which embeds Temporal Shift Modules into a compact 2D CNN backbone to approximate 3D convolutions while drastically reducing FLOPs and parameters. A pose-driven spatial-temporal attention block aligns skeleton cues with RGB frames, enabling selective weighting of keyframes and spatial regions, with predictions fused from the RGB and pose streams. On NTU RGB-D 60/120, PKU-MMD, and Toyota Smarthome, EPAM-Net achieves competitive accuracy while delivering substantial computational savings (up to 72.8× FLOPs and 48.6× fewer parameters), demonstrating strong potential for real-time deployment in activity recognition tasks.

Abstract

Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial-temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the state-of-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-Action-Recognition.

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

TL;DR

The paper tackles real-time multimodal human action recognition by combining RGB and skeleton modalities through an efficient architecture. It introduces X-ShiftNet, which embeds Temporal Shift Modules into a compact 2D CNN backbone to approximate 3D convolutions while drastically reducing FLOPs and parameters. A pose-driven spatial-temporal attention block aligns skeleton cues with RGB frames, enabling selective weighting of keyframes and spatial regions, with predictions fused from the RGB and pose streams. On NTU RGB-D 60/120, PKU-MMD, and Toyota Smarthome, EPAM-Net achieves competitive accuracy while delivering substantial computational savings (up to 72.8× FLOPs and 48.6× fewer parameters), demonstrating strong potential for real-time deployment in activity recognition tasks.

Abstract

Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial-temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the state-of-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-Action-Recognition.
Paper Structure (21 sections, 6 equations, 4 figures, 9 tables)

This paper contains 21 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An example of action pairs that are challenging to differentiate using the skeleton modality in Toyota-Smarthome (left) and in NTU-RGB+D dataset (middle and right), where the skeleton heatmaps of each pair of actions (e.g., reading-writing) are similar.
  • Figure 2: The EPAM-Net architecture consists of visual and pose backbones to extract spatial-temporal features from RGB videos and pose sequences, respectively; a pose-driven spatial-temporal attention block to re-weight visual features accordingly; two classification heads; and final score fusion. The input of the pose network stream is a pseudo-heatmaps volume from N uniformly sampled frames, while the input of the visual network stream consists of M frames selected from these N frames by sampling one out of every $\frac{N}{M}$ frame. $f_s$ and $f_r$ represent skeleton features and visual features, respectively.
  • Figure 3: Illustration of the proposed spatial-temporal attention block. A spatial attention map weights discriminative spatial regions, while a temporal attention map weights keyframes.
  • Figure 4: Classification accuracy per action for top-10 challenging actions on across-subject protocol of NTU RGB+D 120 (a) and Toyota-Smarthome (b) datasets.