Table of Contents
Fetching ...

Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets

Fahd Alhamazani, Yu-Kun Lai, Paul L. Rosin

TL;DR

The paper addresses non-rigid 3D reconstruction from a single depth image by introducing a canonical pose reconstruction framework that maps $X^d$ and $X^m$ to a canonical pose depth image $C^{d}$ and $C^{m}$, enabling subsequent 3D shape recovery. It presents a two-stage network architecture combining a Local Feature Extractor, a Multi-Scale Feature Extractor, and a Canonical Pose Depth Reconstruction module, followed by a pose-conditioned GAN with dual encoders to generate a voxelized shape $Y_{shape} \in \mathbb{R}^{256 \times 256 \times 256}$. Losses are staged: Stage One uses depth and mask losses to learn the canonical pose, while Stage Two uses a weighted BCE and GAN objective with WGAN-GP to refine the surface and recover the full geometry. The approach demonstrates strong data efficiency (approx. $300$ samples) and outperforms state-of-the-art methods on synthetic and TOSCA datasets, with ablations confirming the contributions of LFE, MSFE, and the shape encoder to reconstruction quality.

Abstract

3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.

Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets

TL;DR

The paper addresses non-rigid 3D reconstruction from a single depth image by introducing a canonical pose reconstruction framework that maps and to a canonical pose depth image and , enabling subsequent 3D shape recovery. It presents a two-stage network architecture combining a Local Feature Extractor, a Multi-Scale Feature Extractor, and a Canonical Pose Depth Reconstruction module, followed by a pose-conditioned GAN with dual encoders to generate a voxelized shape . Losses are staged: Stage One uses depth and mask losses to learn the canonical pose, while Stage Two uses a weighted BCE and GAN objective with WGAN-GP to refine the surface and recover the full geometry. The approach demonstrates strong data efficiency (approx. samples) and outperforms state-of-the-art methods on synthetic and TOSCA datasets, with ablations confirming the contributions of LFE, MSFE, and the shape encoder to reconstruction quality.

Abstract

3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.

Paper Structure

This paper contains 15 sections, 17 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Stage one, overview of model. where the input depth image in any pose and output canonicalised depth image
  • Figure 2: The Local Feature Extractor (LFE) takes the single-view depth image $X^d$ and the corresponding mask $X^m$ as input and produces a local feature output, of the same input size, denoted as $Y_{LFE}$
  • Figure 3: The model takes as input the original depth $X^d$, its mask $X^m$ where $[X^d, X^m]=X$, and the local feature output $Y_{LFE}$. It features three encoders, each having a distinct dilation rate, with each encoder made up of down-sample blocks. Following the encoders, the latent codes are concatenated and passed through a fuser for inter-mapping. The subsequent decoder consists of up-sample blocks, culminating in the reconstructed multi-scale features, denoted as $MSF$
  • Figure 4: The canonical reconstruction component leverages the original input $X$, the LFE output $LF$, and the MSFE output $MSF$. The model uses these inputs to determine the canonical form $C$ which consists of canonical form depth image $C^d$ and its mask $C^m$.
  • Figure 5: Stage two, in this stage we employ both the original input $X^{d,m}$ and the estimated depth image $Y_{CR}^{d,m}$ to reconstruct 3D shape ($Y_{shape}$).
  • ...and 4 more figures