Flexible graph convolutional network for 3D human pose estimation
Abu Taib Mohammed Shahjahan, A. Ben Hamza
TL;DR
3D human pose estimation suffers from depth ambiguity and occlusion when using traditional GCNs limited to one-hop neighbors. Flex-GCN introduces a flexible graph convolution that aggregates 1- and 2-hop information via a propagation operator $P = ((1-s)\mathbf{I} + s \hat{\mathbf{A}}) \hat{\mathbf{A}} = (1-s)\hat{\mathbf{A}} + s \hat{\mathbf{A}}^2$, augmented with an initial residual path and learnable adjacency modulation $\check{\mathbf{A}} = \hat{\mathbf{A}} + \mathbf{Q}$, all within a ConvNeXt-inspired residual architecture and a Global Response Normalization layer. The model maintains the same time and memory complexity as standard GCNs while enabling richer, globally informed representations, and achieves competitive results on Human3.6M and MPI-INF-3DHP, with ablations confirming the positive impact of the residual connection and symmetric modulation. These findings suggest a scalable approach for robust 3D pose estimation under occlusion and across datasets, with potential applicability to other graph-based vision tasks. $\mathcal{L}$ combines $L_2$ and $L_1$ penalties to supervise 3D pose predictions, and the method benefits from explicit multi-hop information propagation and learned long-range skeletal relationships.
Abstract
Although graph convolutional networks exhibit promising performance in 3D human pose estimation, their reliance on one-hop neighbors limits their ability to capture high-order dependencies among body joints, crucial for mitigating uncertainty arising from occlusion or depth ambiguity. To tackle this limitation, we introduce Flex-GCN, a flexible graph convolutional network designed to learn graph representations that capture broader global information and dependencies. At its core is the flexible graph convolution, which aggregates features from both immediate and second-order neighbors of each node, while maintaining the same time and memory complexity as the standard convolution. Our network architecture comprises residual blocks of flexible graph convolutional layers, as well as a global response normalization layer for global feature aggregation, normalization and calibration. Quantitative and qualitative results demonstrate the effectiveness of our model, achieving competitive performance on benchmark datasets.
