Table of Contents
Fetching ...

Flexible graph convolutional network for 3D human pose estimation

Abu Taib Mohammed Shahjahan, A. Ben Hamza

TL;DR

3D human pose estimation suffers from depth ambiguity and occlusion when using traditional GCNs limited to one-hop neighbors. Flex-GCN introduces a flexible graph convolution that aggregates 1- and 2-hop information via a propagation operator $P = ((1-s)\mathbf{I} + s \hat{\mathbf{A}}) \hat{\mathbf{A}} = (1-s)\hat{\mathbf{A}} + s \hat{\mathbf{A}}^2$, augmented with an initial residual path and learnable adjacency modulation $\check{\mathbf{A}} = \hat{\mathbf{A}} + \mathbf{Q}$, all within a ConvNeXt-inspired residual architecture and a Global Response Normalization layer. The model maintains the same time and memory complexity as standard GCNs while enabling richer, globally informed representations, and achieves competitive results on Human3.6M and MPI-INF-3DHP, with ablations confirming the positive impact of the residual connection and symmetric modulation. These findings suggest a scalable approach for robust 3D pose estimation under occlusion and across datasets, with potential applicability to other graph-based vision tasks. $\mathcal{L}$ combines $L_2$ and $L_1$ penalties to supervise 3D pose predictions, and the method benefits from explicit multi-hop information propagation and learned long-range skeletal relationships.

Abstract

Although graph convolutional networks exhibit promising performance in 3D human pose estimation, their reliance on one-hop neighbors limits their ability to capture high-order dependencies among body joints, crucial for mitigating uncertainty arising from occlusion or depth ambiguity. To tackle this limitation, we introduce Flex-GCN, a flexible graph convolutional network designed to learn graph representations that capture broader global information and dependencies. At its core is the flexible graph convolution, which aggregates features from both immediate and second-order neighbors of each node, while maintaining the same time and memory complexity as the standard convolution. Our network architecture comprises residual blocks of flexible graph convolutional layers, as well as a global response normalization layer for global feature aggregation, normalization and calibration. Quantitative and qualitative results demonstrate the effectiveness of our model, achieving competitive performance on benchmark datasets.

Flexible graph convolutional network for 3D human pose estimation

TL;DR

3D human pose estimation suffers from depth ambiguity and occlusion when using traditional GCNs limited to one-hop neighbors. Flex-GCN introduces a flexible graph convolution that aggregates 1- and 2-hop information via a propagation operator , augmented with an initial residual path and learnable adjacency modulation , all within a ConvNeXt-inspired residual architecture and a Global Response Normalization layer. The model maintains the same time and memory complexity as standard GCNs while enabling richer, globally informed representations, and achieves competitive results on Human3.6M and MPI-INF-3DHP, with ablations confirming the positive impact of the residual connection and symmetric modulation. These findings suggest a scalable approach for robust 3D pose estimation under occlusion and across datasets, with potential applicability to other graph-based vision tasks. combines and penalties to supervise 3D pose predictions, and the method benefits from explicit multi-hop information propagation and learned long-range skeletal relationships.

Abstract

Although graph convolutional networks exhibit promising performance in 3D human pose estimation, their reliance on one-hop neighbors limits their ability to capture high-order dependencies among body joints, crucial for mitigating uncertainty arising from occlusion or depth ambiguity. To tackle this limitation, we introduce Flex-GCN, a flexible graph convolutional network designed to learn graph representations that capture broader global information and dependencies. At its core is the flexible graph convolution, which aggregates features from both immediate and second-order neighbors of each node, while maintaining the same time and memory complexity as the standard convolution. Our network architecture comprises residual blocks of flexible graph convolutional layers, as well as a global response normalization layer for global feature aggregation, normalization and calibration. Quantitative and qualitative results demonstrate the effectiveness of our model, achieving competitive performance on benchmark datasets.
Paper Structure (10 sections, 2 theorems, 6 equations, 3 figures, 5 tables)

This paper contains 10 sections, 2 theorems, 6 equations, 3 figures, 5 tables.

Key Result

Lemma 1

If two matrices $\bm{M}_{1}$ and $\bm{M}_{2}$ commute, i.e., $\bm{M}_{1}\bm{M}_{2}=\bm{M}_{2}\bm{M}_{1}$, then where $\rho(\cdot)$ denotes matrix spectral radius (i.e., largest absolute value of all eigenvalues).

Figures (3)

  • Figure 1: Network architecture of Flex-GCN for 3D human pose estimation.
  • Figure 2: Visual comparison between Flex-GCN and Modulated GCN on sample actions from the Human3.6M dataset.
  • Figure 3: Performance of our proposed Flex-GCN model on the Human3.6M dataset using varying batch and filter sizes.

Theorems & Definitions (2)

  • Lemma 1
  • Proposition 1