Table of Contents
Fetching ...

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Yao Liu, Gangfeng Cui, Jiahui Luo, Xiaojun Chang, Lina Yao

TL;DR

This work tackles privacy-sensitive two-person interaction recognition by converting depth video into point clouds and applying a four-stage pipeline: Interval Frame Sampling (IFS) to select informative frames, a frame-features learning module to extract local-region, appearance, and motion cues, a two-stream multi-level feature aggregation to produce global and partial representations, and a multi-head transformer classifier for final prediction. The method introduces IFS to balance information content and computational efficiency, and fuses features at global and temporal-partial levels to capture both broad context and fine-grained cues. Extensive experiments on NTU RGB+D 60 and 120 demonstrate state-of-the-art performance, with ablations validating the effectiveness of each component and the robustness of the approach across settings. The proposed privacy-preserving, point-cloud–based framework offers strong practical impact for real-world surveillance and human-computer interaction tasks.

Abstract

As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

TL;DR

This work tackles privacy-sensitive two-person interaction recognition by converting depth video into point clouds and applying a four-stage pipeline: Interval Frame Sampling (IFS) to select informative frames, a frame-features learning module to extract local-region, appearance, and motion cues, a two-stream multi-level feature aggregation to produce global and partial representations, and a multi-head transformer classifier for final prediction. The method introduces IFS to balance information content and computational efficiency, and fuses features at global and temporal-partial levels to capture both broad context and fine-grained cues. Extensive experiments on NTU RGB+D 60 and 120 demonstrate state-of-the-art performance, with ablations validating the effectiveness of each component and the robustness of the approach across settings. The proposed privacy-preserving, point-cloud–based framework offers strong practical impact for real-world surveillance and human-computer interaction tasks.

Abstract

As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.
Paper Structure (31 sections, 11 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Examples of interactions from the NTU RGB+D 120 dataset: (A) hugging, (B) punching, (C) shaking hands, (D) high-five, (E) kicking, (F) whispering, (G) taking a photo and (H) cheers and drink.
  • Figure 2: Two-stream Multi-level Dynamic Point Transformer comprises four main components: a novel frame sampling scheme named Interval Frame Sampling, a frame features learning module, a two-stream multi-level feature aggregation module, and a transformer classification module.
  • Figure 3: A schematic overview of Interval Frame Sampling for two-person interaction recognition. Starting with an original interaction video, in step 1, $p$ frames are sampled from $p$ intervals to represent the corresponding interaction sample during the whole learning process. In step 2, $q$ frames are further sampled from the $p$ frames for each epoch's interaction learning.
  • Figure 4: A frame features learning module extracts local-region features, a frame spatial feature, and a frame temporal feature from each point cloud frame. $n$ denotes the number of points and $m$ denotes the dimension of the points.
  • Figure 5: First, a two-stream multi-level feature aggregation module merges frame-level features into global and partial features. Then, a transformer classification module performs self-attention on these aggregated features and uses the output combined with the original global feature for interaction recognition.
  • ...and 3 more figures