Table of Contents
Fetching ...

Occlusion Resilient 3D Human Pose Estimation

Soumava Kumar Roy, Ilia Badanin, Sina Honari, Pascal Fua

TL;DR

This work represents the deforming body as a spatio-temporal graph and introduces a refinement network that performs graph convolutions over this graph to output 3D poses and simulates the fact that some joints can be hidden for periods of time and train the network to be immune to that.

Abstract

Occlusions remain one of the key challenges in 3D body pose estimation from single-camera video sequences. Temporal consistency has been extensively used to mitigate their impact but the existing algorithms in the literature do not explicitly model them. Here, we apply this by representing the deforming body as a spatio-temporal graph. We then introduce a refinement network that performs graph convolutions over this graph to output 3D poses. To ensure robustness to occlusions, we train this network with a set of binary masks that we use to disable some of the edges as in drop-out techniques. In effect, we simulate the fact that some joints can be hidden for periods of time and train the network to be immune to that. We demonstrate the effectiveness of this approach compared to state-of-the-art techniques that infer poses from single-camera sequences.

Occlusion Resilient 3D Human Pose Estimation

TL;DR

This work represents the deforming body as a spatio-temporal graph and introduces a refinement network that performs graph convolutions over this graph to output 3D poses and simulates the fact that some joints can be hidden for periods of time and train the network to be immune to that.

Abstract

Occlusions remain one of the key challenges in 3D body pose estimation from single-camera video sequences. Temporal consistency has been extensively used to mitigate their impact but the existing algorithms in the literature do not explicitly model them. Here, we apply this by representing the deforming body as a spatio-temporal graph. We then introduce a refinement network that performs graph convolutions over this graph to output 3D poses. To ensure robustness to occlusions, we train this network with a set of binary masks that we use to disable some of the edges as in drop-out techniques. In effect, we simulate the fact that some joints can be hidden for periods of time and train the network to be immune to that. We demonstrate the effectiveness of this approach compared to state-of-the-art techniques that infer poses from single-camera sequences.
Paper Structure (31 sections, 3 equations, 7 figures, 10 tables)

This paper contains 31 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Graph masking. $\mathcal{G}$ is our spatio-temporal graph. The solid colors denote graph nodes corresponding to the same joint over time. The gray edges are spatial edges connecting joints seen at the same time while the colored edges are temporal and connect nodes corresponding to the same joint over time. For clarity, we only show spatial and temporal connections for 11 joints. Along the temporal domain, each joint is connected to its temporal neighbor (i.e.$\Delta = 1$). Moreover, we also show temporal connection with $\Delta = 2$ for two joints (head and left foot) as an example. $\mathcal{M}$ represent the set of binary masks that are used to deactivate some of these edges to create the masked graph $\bar{\mathcal{G}}$, which is then fed to the refinement network. Refer to Section § \ref{['sec:ref_net']} for more details.
  • Figure 2: Our approach. 3D joint coordinates are extracted from individual images by the lifting net $LNet_{{\boldsymbol{\Phi}}}$ and become the nodes of a spatio-temporal graph $\mathcal{G}$. Some of its edges are masked to produce a reduced graph $\bar{\mathcal{G}}$. It is fed to a refinement network $RNet_{{\boldsymbol{\Theta}}}$ that returns the pose in the selected target frame $t_p$. The masking operation is depicted by Fig. \ref{['fig:spatial-temporal-graph-masks']} (Refer to Section § \ref{['sec:lift_net']} and \ref{['sec:ref_net']} for more details).
  • Figure 3: Architecture of the refinement network.$RNet_{{\boldsymbol{\Theta}}}$ is a set of GCNNs that operates on the masked graph $\bar{\mathcal{G}}$. Each relationship-specific GCN is trained on a different set of connections between the joints, which are eventually fused and processed by the embedding-fusion network with parameters ${\boldsymbol{\Theta}}^f$.
  • Figure 4: SportCenter. Qualitative results on the samples from the "Hard" test set for (a) Iskakov et al. Iskakov19, (b) Roy et al. roy22a and (c) Ours.
  • Figure 5: Comparative study on the SportCenter dataset in the semi-supervised setup.
  • ...and 2 more figures