Table of Contents
Fetching ...

Quater-GCN: Enhancing 3D Human Pose Estimation with Orientation and Semi-supervised Training

Xingyu Song, Zhan Li, Shi Chen, Kazuyuki Demachi

TL;DR

Quater-GCN (Q-GCN) introduces a directed graph convolutional framework that jointly models 2D joint coordinates and 2D bone rotations to produce 3D joint positions and 3D bone orientations in quaternion form, enabling richer pose representations. A novel graph configuration with incidence-based hierarchical relations and a four-branch sampling scheme provides adaptive spatial-temporal features, while a semi-supervised training strategy leverages unlabeled data by projecting predicted 4D orientations into 2D rotations to supervise orientation regression. Empirical results across Human3.6M, HumanEva-I, and H3WB demonstrate state-of-the-art accuracy in both coordinate and orientation estimates, with ablations confirming the importance of orientation modeling, directed graphs, and semi-supervision. The approach has broad implications for animation, HCI, and safety-critical applications by delivering more precise and physically coherent 3D human poses from 2D inputs, even when orientation annotations are scarce. The model combines $T$-step temporal processing over skeletons with $N$ joints and $B$ bone joints, regressing $q_b \,\in\, \mathbb{R}^4$ quaternions for each bone and achieving robust performance under GT and detected 2D poses.

Abstract

3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-being. The advent of deep learning significantly advances the performance of 3D pose estimation by incorporating temporal information for predicting the spatial positions of human joints. However, traditional methods often fall short as they primarily focus on the spatial coordinates of joints and overlook the orientation and rotation of the connecting bones, which are crucial for a comprehensive understanding of human pose in 3D space. To address these limitations, we introduce Quater-GCN (Q-GCN), a directed graph convolutional network tailored to enhance pose estimation by orientation. Q-GCN excels by not only capturing the spatial dependencies among node joints through their coordinates but also integrating the dynamic context of bone rotations in 2D space. This approach enables a more sophisticated representation of human poses by also regressing the orientation of each bone in 3D space, moving beyond mere coordinate prediction. Furthermore, we complement our model with a semi-supervised training strategy that leverages unlabeled data, addressing the challenge of limited orientation ground truth data. Through comprehensive evaluations, Q-GCN has demonstrated outstanding performance against current state-of-the-art methods.

Quater-GCN: Enhancing 3D Human Pose Estimation with Orientation and Semi-supervised Training

TL;DR

Quater-GCN (Q-GCN) introduces a directed graph convolutional framework that jointly models 2D joint coordinates and 2D bone rotations to produce 3D joint positions and 3D bone orientations in quaternion form, enabling richer pose representations. A novel graph configuration with incidence-based hierarchical relations and a four-branch sampling scheme provides adaptive spatial-temporal features, while a semi-supervised training strategy leverages unlabeled data by projecting predicted 4D orientations into 2D rotations to supervise orientation regression. Empirical results across Human3.6M, HumanEva-I, and H3WB demonstrate state-of-the-art accuracy in both coordinate and orientation estimates, with ablations confirming the importance of orientation modeling, directed graphs, and semi-supervision. The approach has broad implications for animation, HCI, and safety-critical applications by delivering more precise and physically coherent 3D human poses from 2D inputs, even when orientation annotations are scarce. The model combines -step temporal processing over skeletons with joints and bone joints, regressing quaternions for each bone and achieving robust performance under GT and detected 2D poses.

Abstract

3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-being. The advent of deep learning significantly advances the performance of 3D pose estimation by incorporating temporal information for predicting the spatial positions of human joints. However, traditional methods often fall short as they primarily focus on the spatial coordinates of joints and overlook the orientation and rotation of the connecting bones, which are crucial for a comprehensive understanding of human pose in 3D space. To address these limitations, we introduce Quater-GCN (Q-GCN), a directed graph convolutional network tailored to enhance pose estimation by orientation. Q-GCN excels by not only capturing the spatial dependencies among node joints through their coordinates but also integrating the dynamic context of bone rotations in 2D space. This approach enables a more sophisticated representation of human poses by also regressing the orientation of each bone in 3D space, moving beyond mere coordinate prediction. Furthermore, we complement our model with a semi-supervised training strategy that leverages unlabeled data, addressing the challenge of limited orientation ground truth data. Through comprehensive evaluations, Q-GCN has demonstrated outstanding performance against current state-of-the-art methods.
Paper Structure (24 sections, 18 equations, 4 figures, 7 tables)

This paper contains 24 sections, 18 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Whole architecture of Q-GCN. Q-GCN begins by dividing the input 2D pose sequence into node coordinates and bone rotations, which are represented as vertices and edges in a directed graph, respectively. It then extracts spatial-temporal features from these vertices and edges, incorporating a residual connection with each convolution operation. Following this feature extraction, Q-GCN reconstructs the human 3D pose and 4D orientation using a fully-connected (FC) layer that includes a Squeeze and Excitation (SE) block.
  • Figure 2: Whole configuration of the directed graph and the sampling strategy in Q-GCN. In this graph, vertices and edges are organized in a hierarchical structure, with the root node (typically the pelvis node) serving as the initialization point. The sampling strategy is illustrated by marking the target vertex and edge joint with a dot circle, while different colors on the joints indicate the subsets they belong to, as defined by the sampling strategy.
  • Figure 3: Semi-supervised training strategy for orientation regression. The unlabeled 2D rotations in the latter half of the batch are regressed using projected 2D rotations. These projections are derived from combining the predicted 4D orientations with the predicted 3D positions, both of which are initially trained using labeled data during the first half of the batch.
  • Figure 4: Schematic diagram of Euler angle calculation: $\boldsymbol{p}_{shoulder}$ (rotation center), $\boldsymbol{p}_{elbow}$, and $\boldsymbol{p}_{wrist}$ represent the nodes corresponding to the shoulder, elbow and wrist joints in 3D World Coordinate System respectively. Given the initial target vector $\Vec{v}_{init}$ and initial reference vector $\Vec{r}_{init}$, and proceeding to rotate them sequentially around the $x,y \text{ and } z$ axes by angles $\alpha,\beta, \text{ and } \gamma$ respectively, $(\Vec{v}_x,\Vec{r}_x)$, $(\Vec{v}_y,\Vec{r}_y)$ and finally $(\Vec{v},\Vec{r})$ can be obtained. The diagram displays a gradient arrow, signifying the Euler angle calculation order, which is in reverse to the rotation sequence. Consequently, we calculate $\gamma$ as the rotation angle from the positive $x$-axis to $\vec{v}_{xOy}$, which representing the projection of vector $\vec{v}$ onto the $xOy$ plane. Likewise, $\beta$ is inferred as the rotation angle between vector $\vec{v}$ and the $xOy$ plane. Lastly, $\alpha$ is calculated as the angle spanning from the positive $y$-axis to $\vec{r}_{x_{yOz}}$, the projection of $\vec{r}_{x}$ onto the $yOz$ plane.