Table of Contents
Fetching ...

Joint Temporal Pooling for Improving Skeleton-based Action Recognition

Shanaka Ramesh Gunasekara, Wanqing Li, Jack Yang, Philip Ogunbona

TL;DR

The paper addresses the loss of discriminative motion information in standard temporal pooling for skeleton-based action recognition. It introduces Joint Motion Adaptive Pooling (JMAP), which adaptively defines pooling windows using joint motion intensities and supports frame-wise or joint-wise pooling to better capture cross-space-time and cross-joint dependencies. Through joint motion intensity measurement, active joint selection, and a learned pooling matrix, JMAP preserves motion-rich frames and segments, yielding consistent improvements across backbones on NTU RGB+D 120 and PKU-MMD, and competitive results against state-of-the-art methods. The approach offers a practical, low-overhead enhancement that can be integrated into existing GCN-based skeleton action recognition pipelines, improving performance on ambiguous actions and overall robustness.

Abstract

In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.

Joint Temporal Pooling for Improving Skeleton-based Action Recognition

TL;DR

The paper addresses the loss of discriminative motion information in standard temporal pooling for skeleton-based action recognition. It introduces Joint Motion Adaptive Pooling (JMAP), which adaptively defines pooling windows using joint motion intensities and supports frame-wise or joint-wise pooling to better capture cross-space-time and cross-joint dependencies. Through joint motion intensity measurement, active joint selection, and a learned pooling matrix, JMAP preserves motion-rich frames and segments, yielding consistent improvements across backbones on NTU RGB+D 120 and PKU-MMD, and competitive results against state-of-the-art methods. The approach offers a practical, low-overhead enhancement that can be integrated into existing GCN-based skeleton action recognition pipelines, improving performance on ambiguous actions and overall robustness.

Abstract

In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.
Paper Structure (16 sections, 10 equations, 6 figures, 4 tables)

This paper contains 16 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: In a sequence of Drinking water action, the person does not perform the action until f=38. At f=40 they start and stop at f=64 followed by a few frames which are not related much to the action. In conventional pooling methods, a constant pooling window size is applied across the sequence. But the proposed joint motion intensity adaptive pooling module changes the pooling window w.r.t the motion information and generates pooling windows as in the figure (red boxes) to have wider windows to static segments and thinner windows to dynamic segments.
  • Figure 2: The pipeline for joint motion adaptive pooling module
  • Figure 3: Accuracy change between the CTR-GCN and with JMAP for hard and extreme hard action group.
  • Figure 4: Cumulative Joint Motion Intensity (CJMI) curves with different normalization functions. The CJMI with $tanh$, gives the best curvature with high motion intensities and while the rest of the functions generate curves more aligned with uniform pooling (red dashed line)
  • Figure 5: CJMI curve obtained using All joints and Active joints are presented. The CMJI with active joints shows a better curvature compared to All joints. The red dashed line belongs to uniform sampling
  • ...and 1 more figures