Table of Contents
Fetching ...

Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

Kaiqiang Xiong, Rui Peng, Jiahao Wu, Zhanke Wang, Jie Liang, Xiaoyun Zheng, Feng Gao, Ronggang Wang

TL;DR

This work presents MVD-HuGaS, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model, and proposes a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.

Abstract

3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

TL;DR

This work presents MVD-HuGaS, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model, and proposes a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.

Abstract

3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
Paper Structure (20 sections, 11 equations, 7 figures, 4 tables)

This paper contains 20 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Novel view synthesis quality under sparse inputs. Our method produces superior photometric quality with faithful texture recovery in challenging regions (e.g., leaf gaps in zoomed insets and weakly-textured surfaces), enabled by more accurate geometry with sharper structural boundaries, outperforming state-of-the-art sparse-view 3DGS approaches kerbl20233dzhu2024fsgszhang2024corpark2025dropgaussianhan2024binocular.
  • Figure 2: Framework of ICO-GS. Given sparse input views, we initialize 3D Gaussians and extract deep features. Our method enforces intrinsic geometry-appearance consistency through two synergistic components: (1) Robust Geometric Regularization. Source features are warped to reference views via rendered depth, establishing occlusion-aware multi-view constraints through: (a) Robust Multi-view Photometric Consistency that employs pixel-wise top-$k$ selection for occlusion handling, and (b) Edge-aware Depth Smoothness that preserves sharp geometric structures. (2) Geometry-Guided Appearance Optimization. We leverage geometrically reliable regions identified by (c) Cycle Consistency Depth Filtering to synthesize virtual views, then apply (d) Virtual-view Photometric Consistency between synthesized and rendered images to propagate geometric correctness into appearance learning.
  • Figure 3: Geometry-appearance discrepancy under sparse-view settings. From top to bottom: RGB on training views, depth on training views, and RGB on test views, rendered by 3D Gaussian Splatting kerbl20233d with varying training view densities. With decreasing views, training-view appearance (top) remains well-fitted, but depth quality (middle) collapses with noise and floaters due to insufficient multi-view constraints. This geometry-appearance discrepancy leads to severe artifacts in novel-view rendering (bottom).
  • Figure 4: Visual comparison on LLFF mildenhall2019local dataset.
  • Figure 5: Visual comparison on DTU jensen2014large dataset.
  • ...and 2 more figures