Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition
Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu
TL;DR
This work tackles ambiguity in skeleton-based action recognition by identifying imbalances in spatial-temporal features produced by traditional serial GCN-TCN pipelines. It introduces SF-Head, a lightweight plug-in between GCN and TCN that combines Synchronized Spatial-Temporal Extraction (SSTE) with Adaptive Cross-Dimensional Feature Aggregation (AC-FA), guided by two losses: Feature Redundancy Loss (F-RL) and Feature Consistency Loss (F-CL). The training objective jointly optimizes cross-entropy with these regularizers, yielding a total loss ${\mathcal L}_{total} = {\mathcal L}_{CE} + \lambda_{con}{\mathcal L}_{con} + \lambda_{red}{\mathcal L}_{red}$. Empirically, SF-Head improves ambiguous-action discrimination across NTU RGB+D 60/120, NW-UCLA, and PKU-MMD I, with negligible parameter overhead (<$0.01M$) and no inference cost, demonstrating strong practical value and generalizability to diverse backbones.
Abstract
Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called SF-Head, inserted between GCN and TCN layers. SF-Head first conducts SSTE with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction. It then performs AC-FA, with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. Experimental results on NTU RGB+D 60, NTU RGB+D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions.
