Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition
Yao Liu, Gangfeng Cui, Jiahui Luo, Xiaojun Chang, Lina Yao
TL;DR
This work tackles privacy-sensitive two-person interaction recognition by converting depth video into point clouds and applying a four-stage pipeline: Interval Frame Sampling (IFS) to select informative frames, a frame-features learning module to extract local-region, appearance, and motion cues, a two-stream multi-level feature aggregation to produce global and partial representations, and a multi-head transformer classifier for final prediction. The method introduces IFS to balance information content and computational efficiency, and fuses features at global and temporal-partial levels to capture both broad context and fine-grained cues. Extensive experiments on NTU RGB+D 60 and 120 demonstrate state-of-the-art performance, with ablations validating the effectiveness of each component and the robustness of the approach across settings. The proposed privacy-preserving, point-cloud–based framework offers strong practical impact for real-world surveillance and human-computer interaction tasks.
Abstract
As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.
