Table of Contents
Fetching ...

SAOR: Single-View Articulated Object Reconstruction

Mehmet Aygün, Oisin Mac Aodha

TL;DR

SAOR tackles single-view articulated object reconstruction without category-specific 3D templates or skeleton priors by adopting a skeleton-free, part-based deformation model trained with a cross-instance swap consistency loss and silhouette-based viewpoint sampling. The method predicts shape, texture, and camera pose from a single image in a single forward pass, using differentiable rendering to enforce self-supervised consistency across multiple categories. Key innovations include a part-based articulation mechanism with learned skinning weights and a streamlined swap loss that reduces degeneracy for articulated objects, enabling category-agnostic generalization to over 100 animal categories. The results demonstrate improved 2D keypoint transfer and 3D Chamfer metrics compared to non-3D-supervised baselines, with efficient inference suitable for practical use, though texture realism and extreme viewpoints remain challenging.

Abstract

We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.

SAOR: Single-View Articulated Object Reconstruction

TL;DR

SAOR tackles single-view articulated object reconstruction without category-specific 3D templates or skeleton priors by adopting a skeleton-free, part-based deformation model trained with a cross-instance swap consistency loss and silhouette-based viewpoint sampling. The method predicts shape, texture, and camera pose from a single image in a single forward pass, using differentiable rendering to enforce self-supervised consistency across multiple categories. Key innovations include a part-based articulation mechanism with learned skinning weights and a streamlined swap loss that reduces degeneracy for articulated objects, enabling category-agnostic generalization to over 100 animal categories. The results demonstrate improved 2D keypoint transfer and 3D Chamfer metrics compared to non-3D-supervised baselines, with efficient inference suitable for practical use, though texture realism and extreme viewpoints remain challenging.

Abstract

We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
Paper Structure (22 sections, 11 equations, 15 figures, 10 tables)

This paper contains 22 sections, 11 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: SAOR capable of predicting the 3D shape of an articulated object category from a single image. Our model is trained on multiple categories simultaneously using self-supervision on single-view image collections. It can efficiently predict object pose, 3D shape reconstruction, and unsupervised part-level assignment using only a single forward pass per image at test time in a category-agnostic way.
  • Figure 2: Overview of the generation phase of our SAOR method. Given a single image $I$ as input, we extract a global feature vector $\phi_{im}$ which is decoded by four separate networks ($f_d$, $f_a$, $f_t$, and $f_p$) to generate a final output image $\hat{I}$. We start by deforming an initial sphere, articulate it using a part-based linear blend skinning (LBS) operation $\xi$, texture the mesh, and render it using a differential render $\Pi$ so that it is depicted from the same viewpoint as the input image. The parameters for each of the networks presented are trained in an end-to-end manner using image reconstruction-based self-supervision from multiple different categories using the same model.
  • Figure 3: Illustration of our articulated swap loss. To calculate the loss, a swap image $\hat{I}_{i}^{sw}$ is rendered using a randomly chosen paired image's shape $S_{j}'$, combined with estimated texture, viewpoint, and articulation ($T_{i}, P_{i}, A_{i}$) from the input image $I_{i}$. It ensures that 3D predictions are not degenerate and helps disentangle deformation and articulation.
  • Figure 4: (Top) Subset of the resulting cluster centers that arise from clustering the object segmentation masks. (Bottom) Representative images from each of the clusters above. We can see that our simple clustering operation captures the main viewpoint variations present in the data, e.g., left facing, frontal, right facing, etc.
  • Figure 5: Disentanglement of articulation and deformation. On top, we interpolate articulation latent features between a source and target image, and on the bottom do the same for shape deformation features. $\lambda=1$ indicates that original features are used for reconstruction, while $\lambda=0$ indicates the target ones. We can see that the difference between the reconstructions is explained by articulation changes between the source and target image pairs.
  • ...and 10 more figures