Table of Contents
Fetching ...

Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Yuhang Wen, Zixuan Tang, Yunsheng Pang, Beichen Ding, Mengyuan Liu

TL;DR

ISTA-Net introduces Interactive Spatiotemporal Tokens to jointly model spatial, temporal, and interactive relations for skeleton-based interactive action recognition without relying on fixed subject priors. It integrates 3D-tokenized ISTs with Token Self-Attention blocks and adds Entity Rearrangement to enforce permutation invariance across diverse interacting entities. Through extensive experiments on NTU Mutual, SBU, H2O, and Assembly101, ISTA-Net achieves state-of-the-art results and demonstrates robust ablations validating the proposed components. The work provides a practical, generalizable framework with open-source code for interactive action understanding in human–robot interaction contexts.

Abstract

Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects. To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities. Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net

Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

TL;DR

ISTA-Net introduces Interactive Spatiotemporal Tokens to jointly model spatial, temporal, and interactive relations for skeleton-based interactive action recognition without relying on fixed subject priors. It integrates 3D-tokenized ISTs with Token Self-Attention blocks and adds Entity Rearrangement to enforce permutation invariance across diverse interacting entities. Through extensive experiments on NTU Mutual, SBU, H2O, and Assembly101, ISTA-Net achieves state-of-the-art results and demonstrates robust ablations validating the proposed components. The work provides a practical, generalizable framework with open-source code for interactive action understanding in human–robot interaction contexts.

Abstract

Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects. To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities. Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net
Paper Structure (15 sections, 11 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Examples of individual actions (a)NTU120, group activities (b)cad2009 and interactive actions (c)NTU1208299578H2O_TA-GCN2021. (a) Sequences of single pose could fully depict the action Jump Up. (b) Group activity Waiting is annotated regardless of the pedestrians. (c) Each entity is an integral part of the interactive action. Previous methods focus on one type of these interactions. (d) In this paper, we evaluate on general interactive action recognition task, which addresses the diversity of interacting subjects.
  • Figure 2: The overall architecture of the proposed ISTA-Net for skeleton-based general interactive action recognition.
  • Figure 3: Difficulties of interactive action recognition of diverse entities in four datasets.
  • Figure 4: Visualization of the learnt interactive relations restored from the last TSA Block. The attentive weights are visualized to illustrate the important body parts involved in recognizing different interactive actions. Specifically, ISTA-Net recognizes the Punch action through attentions on the attacker's hands and the victim's limbs. The Hugging action is recognized through attentions on the approaching and contacting body parts. The Giving Object action is recognized through attentions on the hands.