Table of Contents
Fetching ...

Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu

TL;DR

This work tackles ambiguity in skeleton-based action recognition by identifying imbalances in spatial-temporal features produced by traditional serial GCN-TCN pipelines. It introduces SF-Head, a lightweight plug-in between GCN and TCN that combines Synchronized Spatial-Temporal Extraction (SSTE) with Adaptive Cross-Dimensional Feature Aggregation (AC-FA), guided by two losses: Feature Redundancy Loss (F-RL) and Feature Consistency Loss (F-CL). The training objective jointly optimizes cross-entropy with these regularizers, yielding a total loss ${\mathcal L}_{total} = {\mathcal L}_{CE} + \lambda_{con}{\mathcal L}_{con} + \lambda_{red}{\mathcal L}_{red}$. Empirically, SF-Head improves ambiguous-action discrimination across NTU RGB+D 60/120, NW-UCLA, and PKU-MMD I, with negligible parameter overhead (<$0.01M$) and no inference cost, demonstrating strong practical value and generalizability to diverse backbones.

Abstract

Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called SF-Head, inserted between GCN and TCN layers. SF-Head first conducts SSTE with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction. It then performs AC-FA, with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. Experimental results on NTU RGB+D 60, NTU RGB+D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions.

Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

TL;DR

This work tackles ambiguity in skeleton-based action recognition by identifying imbalances in spatial-temporal features produced by traditional serial GCN-TCN pipelines. It introduces SF-Head, a lightweight plug-in between GCN and TCN that combines Synchronized Spatial-Temporal Extraction (SSTE) with Adaptive Cross-Dimensional Feature Aggregation (AC-FA), guided by two losses: Feature Redundancy Loss (F-RL) and Feature Consistency Loss (F-CL). The training objective jointly optimizes cross-entropy with these regularizers, yielding a total loss . Empirically, SF-Head improves ambiguous-action discrimination across NTU RGB+D 60/120, NW-UCLA, and PKU-MMD I, with negligible parameter overhead (<) and no inference cost, demonstrating strong practical value and generalizability to diverse backbones.

Abstract

Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called SF-Head, inserted between GCN and TCN layers. SF-Head first conducts SSTE with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction. It then performs AC-FA, with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. Experimental results on NTU RGB+D 60, NTU RGB+D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions.

Paper Structure

This paper contains 33 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall framework of our proposed method. We propose a lightweight, plug-and-play module SF-Head inserted between GCN and TCN, designed to synchronizing spatial-temporal decoupling with $\mathcal{L}_{red}$ to balance spatial and temporal features, followed by performing adaptive cross-dimensional feature aggregation(AC-FA) module with $\mathcal{L}_{con}$ to align the refined feature with channel, temporal and spatial feature, combining global context and local details.
  • Figure 2: Adaptive Cross-Dimensional Feature Aggregation(AC-FA) module
  • Figure 3: Accuracy of 23 ambiguous actions (9 groups) in descending order. Group 1: writing, typing on a keyboard, playing with phone, reading; Group 2: jump up, hopping; Group 3: clapping, rub two hands together, tear up paper; Group 4: take off jacket, wear jacket; Group 5: shake head, nod head / bow; Group 6: standing up, sitting down; Group 7: salute, taking a selfie, brushing hair, hand waving; Group 8: take off a shoe, wear a shoe; Group 9: wear on glasses, take off glasses
  • Figure 4: Adaptive learning matrix of the action "salute" on NTU-RGB+D 60 dataset
  • Figure 5: Representations of long-term action sequences in the NTU-RGB + D 120 Xset dataset using only the backbone (left), backbone + our module w.o. F-RCL (middle), and backbone + our module (right). Each color represents one unique feature.