Table of Contents
Fetching ...

V-VIPE: Variational View Invariant Pose Embedding

Mara Levy, Abhinav Shrivastava

TL;DR

This paper tackles viewpoint ambiguity in monocular 3D human pose estimation by learning a variational, view-invariant embedding of 3D poses (V-VIPE) in a canonical coordinate space. It decouples pose estimation into a 3D Pose VAE that yields a smooth embedding and a 2D Mapping Network that projects 2D keypoints into that embedding to lift to 3D via a frozen decoder. The approach enables 3D pose retrieval, generation, and monocular 3D reconstruction with strong generalization to unseen camera viewpoints, demonstrated on Human3.6M and MPI-3DHP with Hit@k and MPJPE metrics; ablations show the importance of rotation realignment, triplet loss, and pretraining. The method provides a flexible, camera-agnostic representation that can support downstream tasks such as action recognition and robotics imitation, by offering a consistent 3D pose space across views.

Abstract

Learning to represent three dimensional (3D) human pose given a two dimensional (2D) image of a person, is a challenging problem. In order to make the problem less ambiguous it has become common practice to estimate 3D pose in the camera coordinate space. However, this makes the task of comparing two 3D poses difficult. In this paper, we address this challenge by separating the problem of estimating 3D pose from 2D images into two steps. We use a variational autoencoder (VAE) to find an embedding that represents 3D poses in canonical coordinate space. We refer to this embedding as variational view-invariant pose embedding V-VIPE. Using V-VIPE we can encode 2D and 3D poses and use the embedding for downstream tasks, like retrieval and classification. We can estimate 3D poses from these embeddings using the decoder as well as generate unseen 3D poses. The variability of our encoding allows it to generalize well to unseen camera views when mapping from 2D space. To the best of our knowledge, V-VIPE is the only representation to offer this diversity of applications. Code and more information can be found at https://v-vipe.github.io/.

V-VIPE: Variational View Invariant Pose Embedding

TL;DR

This paper tackles viewpoint ambiguity in monocular 3D human pose estimation by learning a variational, view-invariant embedding of 3D poses (V-VIPE) in a canonical coordinate space. It decouples pose estimation into a 3D Pose VAE that yields a smooth embedding and a 2D Mapping Network that projects 2D keypoints into that embedding to lift to 3D via a frozen decoder. The approach enables 3D pose retrieval, generation, and monocular 3D reconstruction with strong generalization to unseen camera viewpoints, demonstrated on Human3.6M and MPI-3DHP with Hit@k and MPJPE metrics; ablations show the importance of rotation realignment, triplet loss, and pretraining. The method provides a flexible, camera-agnostic representation that can support downstream tasks such as action recognition and robotics imitation, by offering a consistent 3D pose space across views.

Abstract

Learning to represent three dimensional (3D) human pose given a two dimensional (2D) image of a person, is a challenging problem. In order to make the problem less ambiguous it has become common practice to estimate 3D pose in the camera coordinate space. However, this makes the task of comparing two 3D poses difficult. In this paper, we address this challenge by separating the problem of estimating 3D pose from 2D images into two steps. We use a variational autoencoder (VAE) to find an embedding that represents 3D poses in canonical coordinate space. We refer to this embedding as variational view-invariant pose embedding V-VIPE. Using V-VIPE we can encode 2D and 3D poses and use the embedding for downstream tasks, like retrieval and classification. We can estimate 3D poses from these embeddings using the decoder as well as generate unseen 3D poses. The variability of our encoding allows it to generalize well to unseen camera views when mapping from 2D space. To the best of our knowledge, V-VIPE is the only representation to offer this diversity of applications. Code and more information can be found at https://v-vipe.github.io/.
Paper Structure (15 sections, 2 equations, 8 figures, 1 table)

This paper contains 15 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The several functions V-VIPE is capable of. The purple path represents 3D pose retrieval. The blue path represents generation by adding noise to the purple path. The result is a variation of the original pose. The green path shows 2D to 3D pose estimation from several viewpoints.
  • Figure 2: On the left we can see the 3D pose in the original global coordinates with 4 different cameras. The next 4 images are the 3D poses as seen from these 4 cameras.
  • Figure 3: How poses change when we align the points and modify the rotation. On the left is the original pose and on the right is the pose after we have rotated it.
  • Figure 4: The network on top is our "3D Pose VAE Network." First we pass the 3D input through our data processing phase. Once we have the output we can pass that as input to our VAE network, which generates V-VIPE and then attempts to reconstruct the pose. On the bottom is our "2D Mapping Network." 2D keypoints are extracted using a detector. We then pass these through our 2D encoder and then a locked clone of the decoder network from the 3D Pose VAE Network. This reconstructs the original 3D pose.
  • Figure 5: Pose Estimation from 2D images of our model applied to different camera viewpoints. We show 4 sets of results. The ground truth is on the left hand side of each example, while on the right we provide the 4 original views as well as our model 3D output for each view.
  • ...and 3 more figures