Table of Contents
Fetching ...

Two-stage Synthetic Supervising and Multi-view Consistency Self-supervising based Animal 3D Reconstruction by Single Image

Zijian Kuang, Lihang Ying, Shi Jin, Li Cheng

TL;DR

This work tackles the challenge of 3D animal reconstruction from single images by introducing a two-stage framework that first trains on synthetic bird models using a pixel-aligned implicit function with differentiable rendering, then fine-tunes on real imagery with 2D self-supervision and silhouettes. A 2D multi-view consistency loss enforces view-invariant reconstructions, while a transfer-learning strategy adapts the model to real data. The approach yields superior results for bird 3D digitization and generalizes to other animals such as horses, cows, bears, and dogs, outperforming state-of-the-art supervised methods in both shape and texture reconstruction. The methodology leverages synthetic data to bootstrap learning and uses real-world silhouettes to bridge domain gaps, offering a practical pathway for animal 3D digitization from single-view images.

Abstract

Pixel-aligned Implicit Function (PIFu) effectively captures subtle variations in body shape within a low-dimensional space through extensive training with human 3D scans, its application to live animals presents formidable challenges due to the difficulty of obtaining animal cooperation for 3D scanning. To address this challenge, we propose the combination of two-stage supervised and self-supervised training to address the challenge of obtaining animal cooperation for 3D scanning. In the first stage, we leverage synthetic animal models for supervised learning. This allows the model to learn from a diverse set of virtual animal instances. In the second stage, we use 2D multi-view consistency as a self-supervised training method. This further enhances the model's ability to reconstruct accurate and realistic 3D shape and texture from largely available single-view images of real animals. The results of our study demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative aspects of bird 3D digitization. The source code is available at https://github.com/kuangzijian/drifu-for-animals.

Two-stage Synthetic Supervising and Multi-view Consistency Self-supervising based Animal 3D Reconstruction by Single Image

TL;DR

This work tackles the challenge of 3D animal reconstruction from single images by introducing a two-stage framework that first trains on synthetic bird models using a pixel-aligned implicit function with differentiable rendering, then fine-tunes on real imagery with 2D self-supervision and silhouettes. A 2D multi-view consistency loss enforces view-invariant reconstructions, while a transfer-learning strategy adapts the model to real data. The approach yields superior results for bird 3D digitization and generalizes to other animals such as horses, cows, bears, and dogs, outperforming state-of-the-art supervised methods in both shape and texture reconstruction. The methodology leverages synthetic data to bootstrap learning and uses real-world silhouettes to bridge domain gaps, offering a practical pathway for animal 3D digitization from single-view images.

Abstract

Pixel-aligned Implicit Function (PIFu) effectively captures subtle variations in body shape within a low-dimensional space through extensive training with human 3D scans, its application to live animals presents formidable challenges due to the difficulty of obtaining animal cooperation for 3D scanning. To address this challenge, we propose the combination of two-stage supervised and self-supervised training to address the challenge of obtaining animal cooperation for 3D scanning. In the first stage, we leverage synthetic animal models for supervised learning. This allows the model to learn from a diverse set of virtual animal instances. In the second stage, we use 2D multi-view consistency as a self-supervised training method. This further enhances the model's ability to reconstruct accurate and realistic 3D shape and texture from largely available single-view images of real animals. The results of our study demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative aspects of bird 3D digitization. The source code is available at https://github.com/kuangzijian/drifu-for-animals.
Paper Structure (12 sections, 1 equation, 5 figures, 2 tables)

This paper contains 12 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our pipeline for digitalizing models using differentiable rendering and implicit functions. In Stage 1, we use an implicit function to predict the continuous probability field pifu for the synthetic bird's inside/outside regions based on the given bird image. Then, using differentiable rendering, we generate a render of the 3D implicit representation of the synthetic bird produced by a pixel-aligned feature decoder, rendering it into 2D images for multi-view self-supervised learning.
  • Figure 2: In Stage 2, we utilize a pre-trained pixel-aligned feature encoder-decoder that was trained on synthetic birds. We incorporate real bird images and their silhouettes through transfer learning.
  • Figure 3: Qualitative results showcasing single-view 3D and textured reconstructions of real bird images from the CUB-200-2011 dataset cub.
  • Figure 4: Qualitative single-view 3D reconstruction results on real bird images from the CUB-200-2011 dataset cub are shown in Figure \ref{['fig4']}.
  • Figure 5: Qualitative single-view 3D reconstruction results on real animal images from the Weizmann horses dataset weizmann can be seen in Figure \ref{['fig5']}.