Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets
Fahd Alhamazani, Yu-Kun Lai, Paul L. Rosin
TL;DR
The paper addresses non-rigid 3D reconstruction from a single depth image by introducing a canonical pose reconstruction framework that maps $X^d$ and $X^m$ to a canonical pose depth image $C^{d}$ and $C^{m}$, enabling subsequent 3D shape recovery. It presents a two-stage network architecture combining a Local Feature Extractor, a Multi-Scale Feature Extractor, and a Canonical Pose Depth Reconstruction module, followed by a pose-conditioned GAN with dual encoders to generate a voxelized shape $Y_{shape} \in \mathbb{R}^{256 \times 256 \times 256}$. Losses are staged: Stage One uses depth and mask losses to learn the canonical pose, while Stage Two uses a weighted BCE and GAN objective with WGAN-GP to refine the surface and recover the full geometry. The approach demonstrates strong data efficiency (approx. $300$ samples) and outperforms state-of-the-art methods on synthetic and TOSCA datasets, with ablations confirming the contributions of LFE, MSFE, and the shape encoder to reconstruction quality.
Abstract
3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.
