Table of Contents
Fetching ...

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho, Tae-Kyun Kim

TL;DR

BHaRNet addresses the challenge of capturing fine-grained hand motions in skeleton-based HAR by introducing a dual-stream architecture with body- and hand-expert branches, augmented by cross-attention and a complementary ensemble loss. The model preserves modality-specific cues through expertized branches while enabling cooperative fusion via cross-attention and a pooling attention mechanism, and extends to RGB-guided multi-modal learning inspired by MMNet. Empirical results on NTU RGB+D 60/120, PKU-MMD, and Northwestern-UCLA demonstrate strong performance with reduced GFLOPs and parameters compared to unified body-hand graphs, notably improving hand-intensive action recognition from $86.4\%$ to $93.0\%$ in certain settings. The approach offers robust, scalable benefits for robotics and human–robot interaction, enabling efficient integration of hand dynamics into accurate action recognition, including multi-modal contexts.

Abstract

Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

TL;DR

BHaRNet addresses the challenge of capturing fine-grained hand motions in skeleton-based HAR by introducing a dual-stream architecture with body- and hand-expert branches, augmented by cross-attention and a complementary ensemble loss. The model preserves modality-specific cues through expertized branches while enabling cooperative fusion via cross-attention and a pooling attention mechanism, and extends to RGB-guided multi-modal learning inspired by MMNet. Empirical results on NTU RGB+D 60/120, PKU-MMD, and Northwestern-UCLA demonstrate strong performance with reduced GFLOPs and parameters compared to unified body-hand graphs, notably improving hand-intensive action recognition from to in certain settings. The approach offers robust, scalable benefits for robotics and human–robot interaction, enabling efficient integration of hand dynamics into accurate action recognition, including multi-modal contexts.

Abstract

Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Performance vs. FLOPs in skeleton-based action recognition(left) and multi-modal action recognition(right). Our models(red star) shows competitive accuracy at lower GFLOPs compared to existing SOTA methods in both tasks. MMNet* is the multi-modal backbone of our model, with ResNet18.
  • Figure 2: Overview of our proposed action recognition model. The model is a dual-stream network consists of Body and Hand expert models, utilizing GCN backbones and cross-attention for enhanced feature fusion.
  • Figure 3: Shortcut of our Pooling Attention Module simplifying cross attention mechanism.
  • Figure 4: Pipeline of our Expertized Branch Model(Left) and Multi-modal Architecture(Right). In Expertized Branch Model, the blue-colored boxes indicate interactive branches and blue dashed boxes are expertized branches. We integrate the Expertized Branch Model for both the Joint and Bone streams, and add an RGB stream with its own training path (bold lines). The RGB branch receives body-joint guidance from the body-expertized branch, focusing the visual feature extractor on relevant spatio-temporal regions.
  • Figure 5: Confusion matrix on hand-oriented actions(10 classes to visualize). Ours(BHaRNet-E) outperforms SkeleT by focusing on subtle finger articulations. Classes “Okay sign” and “Victory sign” mentioned in introduction are each index 70, 71 in the matrices.