Table of Contents
Fetching ...

Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis

Dominik Borer, Jakob Buhmann, Martin Guay

TL;DR

This work trains a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips and outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

Abstract

Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis

TL;DR

This work trains a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips and outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

Abstract

Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

Paper Structure

This paper contains 25 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the model and training procedure. There are 5 main components (orange). First, the input image $x$ is mapped to a skeleton image representation $y$. From this the 2D pose $p_{2D}$ is estimated, which is then uplifted to 3D joint positions and orientations $p_{3D}$. The reprojected coordinates are then used to analytically create a skeleton image $\hat{y}$, from which the input image is reconstructed $\hat{x}$. To train the model we use a mixture of synthetic and real data to optimize several objectives (green).
  • Figure 2: Samples from the training dataset. Top: Synthetically generated data. Bottom: Unlabelled, real, in-the-wild videos. The synthetic data contains a lot of variation in pose, appearance and background and the real data covers a variety of different people performing various motions.
  • Figure 3: The single-channel skeleton image representation Jakab:2020:CVPR suffers from ambiguities and fails to capture the body part semantics, causing flips in the predicted pose (e.g. red and blue should be the right arm and leg).
  • Figure 4: Our multi-channel skeleton image representation. Each channel (visualized with different colors) represents a semantically meaningful set of joints.
  • Figure 5: Predicted poses when using our multi-channel skeleton image. The predictions are accurate for a wide range of poses and do not suffer from left/right flips.
  • ...and 4 more figures