Table of Contents
Fetching ...

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Remy Sabathier, Niloy J. Mitra, David Novotny

TL;DR

This work significantly improves the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings, which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences.

Abstract

We present a method to build animatable dog avatars from monocular videos. This is challenging as animals display a range of (unpredictable) non-rigid movements and have a variety of appearance details (e.g., fur, spots, tails). We develop an approach that links the video frames via a 4D solution that jointly solves for animal's pose variation, and its appearance (in a canonical pose). To this end, we significantly improve the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings, which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences. To model appearance, we propose an implicit duplex-mesh texture that is defined in the canonical pose, but can be deformed using SMAL pose coefficients and later rendered to enforce a photometric compatibility with the input video frames. On the challenging CoP3D and APTv2 datasets, we demonstrate superior results (both in terms of pose estimates and predicted appearance) to existing template-free (RAC) and template-based approaches (BARC, BITE).

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

TL;DR

This work significantly improves the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings, which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences.

Abstract

We present a method to build animatable dog avatars from monocular videos. This is challenging as animals display a range of (unpredictable) non-rigid movements and have a variety of appearance details (e.g., fur, spots, tails). We develop an approach that links the video frames via a 4D solution that jointly solves for animal's pose variation, and its appearance (in a canonical pose). To this end, we significantly improve the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings, which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences. To model appearance, we propose an implicit duplex-mesh texture that is defined in the canonical pose, but can be deformed using SMAL pose coefficients and later rendered to enforce a photometric compatibility with the input video frames. On the challenging CoP3D and APTv2 datasets, we demonstrate superior results (both in terms of pose estimates and predicted appearance) to existing template-free (RAC) and template-based approaches (BARC, BITE).
Paper Structure (39 sections, 9 equations, 9 figures, 3 tables)

This paper contains 39 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Animal Avatars. Given a monocular video of a dog, we propose a template-based method to reconstruct the shape $\beta$, motion $\theta_{t}$ and texture $\psi$. We address the challenge of insufficient supervision for unconventional views by integrating Continuous Surface Embeddings with an articulated mesh. We introduce a novel implicit duplex-mesh texture model, jointly optimized alongside motion parameters.
  • Figure 1: Texture reconstruction from BARC BITE pose predictions. We optimize our duplex-mesh renderer model with fixed set of pose $\{\theta_{t}\}$ estimated by BARC and BITE on sequences from COP3D.
  • Figure 2: Method overview. Two stage process: First, we initialize root pose $g_{t}^{0}$ via PnP-RANSAC, utilizing CSE mesh-pixel correspondences. Then, we jointly optimize shape $\beta$, time-varying pose $\theta_{t}$ and implicit texture $\psi$ through an analysis-by-synthesis approach, leveraging mask $\mathcal{L}^{mask}$, dense correspondence $\mathcal{L}^{cse}$ (optimized models in orange), and photometric $\mathcal{L}^{photo}$ signals.
  • Figure 2: Additional qualitative comparison.
  • Figure 3: SMAL CSE. We align the CSE mesh template (top left) with the SMAL template (top right) in order to setup the CSE coordinate frame over the surface of the SMAL mesh. In combination with a pretrained image-to-CSE predictor, this allows establishing dense correspondences between surface points of the SMAL template and the corresponding image pixels. Note that the image CSE detections (rows labelled "CSE") provide dense correspondence covering all parts of the dog body, which is not the case of sparse keypoints ("Keypoints" rows).
  • ...and 4 more figures