HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition
Jinfu Liu, Baiqiao Yin, Jiaying Lin, Jiajun Wen, Yue Li, Mengyuan Liu
TL;DR
The paper addresses skeleton-based action recognition by moving beyond single-backbone models and proposes the Hybrid Dual-Branch Network (HDBN), which fuses a Graph Convolutional Network (MixGCN) and a Transformer-based backbone (MixFormer) to jointly exploit local graph structure and global information. It processes multiple skeleton modalities (J, B, JM, BM) and leverages a 3D pose estimator to derive 3D data from 2D inputs, enabling robust 2D/3D fusion. Key contributions include the use of three dynamic adjacency GCNs, the Skeleton MixFormer transformer backbone, and a late-fusion ensemble that delivers strong performance on UAV-Human CSv1/CSv2 benchmarks. On UAV-Human, HDBN achieves 47.95% and 75.36% top-1 accuracy on CSv1 and CSv2, respectively, outperforming many existing methods and highlighting the practical benefit of backbone complementarity for skeleton-only action recognition.
Abstract
Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: https://github.com/liujf69/ICMEW2024-Track10.
