Table of Contents
Fetching ...

3D-Consistent Human Avatars with Sparse Inputs via Gaussian Splatting and Contrastive Learning

Haoyu Zhao, Hao Wang, Chen Yang, Wei Shen

TL;DR

This work proposes CHASE, a novel framework that achieves dense-input-level performance using only sparse inputs through two key innovations: cross-pose intrinsic 3D consistency supervision and 3D geometry contrastive learning.

Abstract

Existing approaches for human avatar generation--both NeRF-based and 3D Gaussian Splatting (3DGS) based--struggle with maintaining 3D consistency and exhibit degraded detail reconstruction, particularly when training with sparse inputs. To address this challenge, we propose CHASE, a novel framework that achieves dense-input-level performance using only sparse inputs through two key innovations: cross-pose intrinsic 3D consistency supervision and 3D geometry contrastive learning. Building upon prior skeleton-driven approaches that combine rigid deformation with non-rigid cloth dynamics, we first establish baseline avatars with fundamental 3D consistency. To enhance 3D consistency under sparse inputs, we introduce a Dynamic Avatar Adjustment (DAA) module, which refines deformed Gaussians by leveraging similar poses from the training set. By minimizing the rendering discrepancy between adjusted Gaussians and reference poses, DAA provides additional supervision for avatar reconstruction. We further maintain global 3D consistency through a novel geometry-aware contrastive learning strategy. While designed for sparse inputs, CHASE surpasses state-of-the-art methods across both full and sparse settings on ZJU-MoCap and H36M datasets, demonstrating that our enhanced 3D consistency leads to superior rendering quality.

3D-Consistent Human Avatars with Sparse Inputs via Gaussian Splatting and Contrastive Learning

TL;DR

This work proposes CHASE, a novel framework that achieves dense-input-level performance using only sparse inputs through two key innovations: cross-pose intrinsic 3D consistency supervision and 3D geometry contrastive learning.

Abstract

Existing approaches for human avatar generation--both NeRF-based and 3D Gaussian Splatting (3DGS) based--struggle with maintaining 3D consistency and exhibit degraded detail reconstruction, particularly when training with sparse inputs. To address this challenge, we propose CHASE, a novel framework that achieves dense-input-level performance using only sparse inputs through two key innovations: cross-pose intrinsic 3D consistency supervision and 3D geometry contrastive learning. Building upon prior skeleton-driven approaches that combine rigid deformation with non-rigid cloth dynamics, we first establish baseline avatars with fundamental 3D consistency. To enhance 3D consistency under sparse inputs, we introduce a Dynamic Avatar Adjustment (DAA) module, which refines deformed Gaussians by leveraging similar poses from the training set. By minimizing the rendering discrepancy between adjusted Gaussians and reference poses, DAA provides additional supervision for avatar reconstruction. We further maintain global 3D consistency through a novel geometry-aware contrastive learning strategy. While designed for sparse inputs, CHASE surpasses state-of-the-art methods across both full and sparse settings on ZJU-MoCap and H36M datasets, demonstrating that our enhanced 3D consistency leads to superior rendering quality.
Paper Structure (19 sections, 8 equations, 7 figures, 3 tables)

This paper contains 19 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: CHASE. We propose an efficient method for creating 3D-consistent animatable avatars from just videos. Our method achieve better quality to the most recent SOTA methods wen2024gomavatarhu2024gauhumanqian20243dgs in both full and sparse inputs.
  • Figure 2: CHASE Framework. We first initialize 3D Gaussians in canonical space by randomly sampling 50k points on the SMPL mesh surface. Then, we integrate a rigid human articulation and a non-rigid deformation neural field to deform the 3D Gaussians in canonical space ${\mathcal{G}_c}$ to the observation space ${\mathcal{G}_o}$. Next, we select similar poses/images from the dataset for each training pose/image and then adjust the deformed Gaussians ${\mathcal{G}_o}$ to the similar pose ${\mathcal{G}_a}$ using Dynamic Avatar Adjustment (DAA). Minimizing the differences between the rendered adjusted Gaussians ${\mathcal{G}_a}$ and the selected similar images $x_i^a$ serves as an additional supervision. Furthermore, we propose a 3D geometry contrastive learning, which involves comparing features from a 3D feature extractor to improve the avatar’s global 3D consistency. Negative pairs consist of the features of the deformed Gaussians ${\mathcal{G}_o}$ and the adjusted Gaussians ${\mathcal{G}_a}$. In contrast, positive pairs include the features of ${\mathcal{G}_o'}$, which is deformed from the canonical space to match the pose adjustments seen in ${\mathcal{G}_a}$, and ${\mathcal{G}_a}$.
  • Figure 3: For each training pose/image, we select similar poses/images from the dataset and then adjust the deformed Gaussians using DAA. By minimizing the difference between the rendered image of the adjusted avatar and the selected similar pose image, we introduce additional supervision, thereby refining the creation of photo-realistic and animatable avatars.
  • Figure 4: Qualitative Comparison on ZJU-MoCap peng2020neural. We present results for full and sparse inputs (5% of the full inputs) on the ZJU-MoCap dataset. Results show that our CHASE can produce realistic details with both full and sparse inputs, while other approaches struggle to generate smooth details.
  • Figure 5: Qualitative Comparison on H36M ionescu2013human3 with sparse inputs. We demonstrate that our method effectively produces realistic details for novel pose in both rendered images and geometry, whereas other approach struggles to achieve smooth details.
  • ...and 2 more figures