Table of Contents
Fetching ...

Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Mengyuan Liu, Chen Chen, Songtao Wu, Fanyang Meng, Hong Liu

TL;DR

The paper tackles skeleton-based interactive action recognition by addressing the limitation of split-and-fusion GCNs that overlook mutual relationships between paired entities. It introduces me-GCN, which stacks me-GC layers containing mutual topology excitation (MTE) and mutual feature excitation (MFE) to enable cross-entity information exchange at both topology and feature levels. Empirical results on Assemble101, NTU60-Interaction, and NTU120-Interaction show state-of-the-art performance, with notable gains over strong baselines and robust ablations confirming the value of mutual learning. The approach offers a principled, efficient way to model mutual semantic cues in interactive actions, with potential for further integration with Transformer-based architectures.

Abstract

Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.

Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

TL;DR

The paper tackles skeleton-based interactive action recognition by addressing the limitation of split-and-fusion GCNs that overlook mutual relationships between paired entities. It introduces me-GCN, which stacks me-GC layers containing mutual topology excitation (MTE) and mutual feature excitation (MFE) to enable cross-entity information exchange at both topology and feature levels. Empirical results on Assemble101, NTU60-Interaction, and NTU120-Interaction show state-of-the-art performance, with notable gains over strong baselines and robust ablations confirming the value of mutual learning. The approach offers a principled, efficient way to model mutual semantic cues in interactive actions, with potential for further integration with Transformer-based architectures.

Abstract

Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.
Paper Structure (12 sections, 6 equations, 6 figures, 8 tables)

This paper contains 12 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of our general idea. To recognize interactive actions, e.g., "high five" and "handshaking" (a), previous GCN-based methods follow a split-and-fusion pipeline (b), which overlooks the mutual semantic relationships between interactive body parts. To solve this problem, our method involves mutual learning for topology excitation and feature excitation (c). Note that only one layer is shown in (b) and (c) for simplicity.
  • Figure 2: Overview of our proposed mutual excitation graph convolutional network (me-GCN), which contains an input layer, $K$ mutual excitation graph convolution (me-GC) layers, and an inference layer. Each me-GC layer contains a mutual topology excitation module (MTE), a mutual feature excitation module (MFE), and two graph convolution operations. FGB and FFB denote feature generation block and feature fusion block respectively, and function $\mathcal{N}$ (see Eq. \ref{['eq4']}) is used to fuse outputs from FGB.
  • Figure 3: t-SNE van2008visualizing visualization of skeleton sequence representation on the test set of the NTU120-Interaction dataset learned by Baseline method and different variants of our model: MTE only, MFE only, and Ours. Compared with the Baseline, ours learns more distinctive representations to differentiate similar interactive actions. Noting that we use both shape and color to denote different actions.
  • Figure 4: Effect of mutual learning on the activation score per joint for one entity participating in an "shaking hands" action. We observe that mutual learning increases the activation score of the critical joint which shows a strong correlation with the action label.
  • Figure 5: Confusion matrix on NTU120-Interaction dataset using cross subject protocol
  • ...and 1 more figures