Table of Contents
Fetching ...

4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

Jin Lyu, Liang An, Pujin Cheng, Yebin Liu, Xiaoying Tang

TL;DR

This work proposes a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction, and introduces a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video.

Abstract

4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.

4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

TL;DR

This work proposes a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction, and introduces a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video.

Abstract

4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.
Paper Structure (36 sections, 5 equations, 11 figures, 7 tables)

This paper contains 36 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Two samples of the VarenPoser dataset. Each sample is a video, and we only show two frames for each sample.
  • Figure 2: Overview of 4DEquine.(a) AniMoFormer: A spatio-temporal transformer with post-optimization for motion recovery. (b) EquineGS: A feed-forward network to reconstruct a canonical 3D Gaussian avatar from a single image. (c) DSTG-Block: The dual-stream architecture used in EquineGS.
  • Figure 3: VarenTex Generation pipeline. Normal and canonical coordinate maps (CCM) rendered from VarenPoser meshes , alongside a ControlNet-generated reference image , are fed into UniTex to synthesize multi-view training images.
  • Figure 4: Qualitative comparison with the SOTA methods on the AiM dataset. GART$^{*}$: Few-shot GART. "Input" here is the middle frame of each test video clip. Note that the input image for EquineGS is the first image in the video; therefore, the results of "Ours" shown in this figure correspond to novel-pose animation.
  • Figure 5: Zero-shot generalization of 4DEquine on Internet images of unseen species donkey. "GT" is also the input image.
  • ...and 6 more figures