Joint Temporal Pooling for Improving Skeleton-based Action Recognition
Shanaka Ramesh Gunasekara, Wanqing Li, Jack Yang, Philip Ogunbona
TL;DR
The paper addresses the loss of discriminative motion information in standard temporal pooling for skeleton-based action recognition. It introduces Joint Motion Adaptive Pooling (JMAP), which adaptively defines pooling windows using joint motion intensities and supports frame-wise or joint-wise pooling to better capture cross-space-time and cross-joint dependencies. Through joint motion intensity measurement, active joint selection, and a learned pooling matrix, JMAP preserves motion-rich frames and segments, yielding consistent improvements across backbones on NTU RGB+D 120 and PKU-MMD, and competitive results against state-of-the-art methods. The approach offers a practical, low-overhead enhancement that can be integrated into existing GCN-based skeleton action recognition pipelines, improving performance on ambiguous actions and overall robustness.
Abstract
In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.
