Table of Contents
Fetching ...

Improving Skeleton-based Action Recognition with Interactive Object Information

Hao Wen, Ziqian Lu, Fengli Shen, Zhe-Ming Lu, Jialin Cui

TL;DR

This paper tackles the limitation of skeleton-only action recognition in scenarios involving object interactions by introducing interactive object nodes and a Spatial Temporal Variable Graph Convolutional Network (ST-VGCN). It presents a variable-graph framework that unifies skeleton joints and object nodes, powered by modules such as Class Attribute Fusion, Weighted Node Pooling, and Node Balance Loss, plus a data-augmentation strategy called Random Node Attack to combat overfitting. The authors create two datasets, NTU RGB+D+Object 60 and JXGC 24, adding millions of object nodes and enabling self-training-based object discovery, with CLIP-based object attributes to enhance representation. Experimental results across NTU RGB+D 60, NTU RGB+D 120, and JXGC 24 demonstrate state-of-the-art performance, especially on actions with human-object interactions, and ablations validate the contribution of each component, signaling strong practical impact for industrial and real-world action recognition tasks.

Abstract

Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split, it is 99.2\%.

Improving Skeleton-based Action Recognition with Interactive Object Information

TL;DR

This paper tackles the limitation of skeleton-only action recognition in scenarios involving object interactions by introducing interactive object nodes and a Spatial Temporal Variable Graph Convolutional Network (ST-VGCN). It presents a variable-graph framework that unifies skeleton joints and object nodes, powered by modules such as Class Attribute Fusion, Weighted Node Pooling, and Node Balance Loss, plus a data-augmentation strategy called Random Node Attack to combat overfitting. The authors create two datasets, NTU RGB+D+Object 60 and JXGC 24, adding millions of object nodes and enabling self-training-based object discovery, with CLIP-based object attributes to enhance representation. Experimental results across NTU RGB+D 60, NTU RGB+D 120, and JXGC 24 demonstrate state-of-the-art performance, especially on actions with human-object interactions, and ablations validate the contribution of each component, signaling strong practical impact for industrial and real-world action recognition tasks.

Abstract

Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split, it is 99.2\%.
Paper Structure (29 sections, 14 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 14 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustrating the role of object nodes in aiding the classifier to distinguish actions with similar skeleton poses. (a) "Reading"; (b) "Writing"; (c) "Reading" with the added "Book" node; (d) "Writing" with the added "Pen" node.
  • Figure 2: We perform pose estimation and object detection on the input video to extract skeleton and object nodes. Next, we generate Variable Graph sequences and concatenate the encoded class attributes. Finally, these sequences are fed into ST-VGCN for action classification.
  • Figure 3: An overview of the JXGC 24 dataset. These two category dimensions can constitute three different ways of splitting the dataset.
  • Figure 4: Changes in classification accuracy for categories where object nodes are introduced in NTU RGB+D 60.
  • Figure 5: The impact of the weight of node balance loss on the recognition results in JXGC 24.