Table of Contents
Fetching ...

Heatmap Pooling Network for Action Recognition from RGB Videos

Mengyuan Liu, Jinfu Liu, Yongkang Jiang, Bin He

TL;DR

The paper tackles action recognition from RGB videos by addressing information redundancy and noise in heatmap representations. It introduces HP-Net, a heatmap-pooled, pose-guided framework consisting of a Feedback Pooling Module, lightweight graph-based topology modeling, and multimodal fusion via Spatial-Motion Co-learning and Text Refinement Modulation. The approach achieves state-of-the-art results across four benchmarks (NTU-60, NTU-120, Toyota-Smarthome, UAV-Human) and demonstrates strong transferability and robustness, even when integrated with RGB and text modalities. Public code availability further enables practical adoption and extension to other action-recognition datasets and tasks.

Abstract

Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.

Heatmap Pooling Network for Action Recognition from RGB Videos

TL;DR

The paper tackles action recognition from RGB videos by addressing information redundancy and noise in heatmap representations. It introduces HP-Net, a heatmap-pooled, pose-guided framework consisting of a Feedback Pooling Module, lightweight graph-based topology modeling, and multimodal fusion via Spatial-Motion Co-learning and Text Refinement Modulation. The approach achieves state-of-the-art results across four benchmarks (NTU-60, NTU-120, Toyota-Smarthome, UAV-Human) and demonstrates strong transferability and robustness, even when integrated with RGB and text modalities. Public code availability further enables practical adoption and extension to other action-recognition datasets and tasks.

Abstract

Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.

Paper Structure

This paper contains 25 sections, 13 equations, 9 figures, 18 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between existing methods and ours. Existing methods for modeling features from RGB videos have several limitations. (a) RGB-based methods rely on RGB images that are prone to environmental interference and noise. (b) Pose-based methods use pose data with limited information, making it difficult to support fine-grained action recognition. (c) Pose-guided feature aggregation methods require additional modeling and fusion networks, resulting in higher computational cost. (d) Heatmap-based methods often involve redundant data and incur high storage costs. (e) Our proposed HP-Net incorporates a Feedback Pooling Module (FPM) to obtain more efficient heatmap representations by reducing redundancy while retaining relevant information. In addition, the Text Refinement Modulation Module (TRMM) is used to refine the visual features with textual guidance, helping to further improve representation quality.
  • Figure 2: Framework of our proposed Heatmap Pooling Network (HP-Net). We utilize a feedback pooling module (FPM) to extract information-rich, robust and concise heatmap pooled features from videos. The spatial-motion co-learning module (SMCLM) is designed to model spatial and motion dynamics within heatmap pooled features, enabling the generation of spatiotemporal representations that seamlessly integrate with other modalities. The text refinement modulation module (TRMM) aims to enhance and modulate the basic text features of action labels, transforming them into enriched semantic representations that effectively support human action recognition.
  • Figure 3: Visualization of human heatmaps obtained by different human pose estimation models. (a) Visualization of the video frame sequence. (b) Visualization of human heatmaps obtained using ResNet xiao2018simple. (c) Visualization of human heatmaps obtained using HR-Net wang2020deep. (d) Visualization of human heatmaps obtained using SimCC li2022simcc.
  • Figure 4: Visualization of different action samples. In each subplot, the first column represents the visualization of the 2D pose data obtained from human pose estimation on the original RGB video frame. The second, third and fourth columns correspond to the human heatmaps $H_1$, $H_2$ and $H_3$, respectively. The fifth column shows the heatmap pooled features output by our Feedback Pooling Module (FPM).
  • Figure 5: Visualization of the recognition accuracy for some action categories. (a) The top-8 action categories where the use of heatmap pooled features outperforms 2D pose in the NTU 120 dataset. (b) The top-8 action categories where the use of heatmap pooled features outperforms 2D pose in the Toyota-Smarthome dataset. (c) The top-8 action categories where the use of heatmap pooled features outperforms 2D pose in the UAV-Human dataset.
  • ...and 4 more figures