Table of Contents
Fetching ...

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, Ehsan Adeli

TL;DR

Occluded humans in monocular videos present a major challenge for 3D rendering. OccFusion combines 3D Gaussian splatting with pretrained 2D diffusion priors in a three-stage pipeline—Initialization, Optimization with Score Distillation Sampling, and Refinement with in-context inpainting—to recover complete geometry and faithful appearance under occlusion. It introduces OccGauHuman, a streamlined GauHuman variant tailored for occlusion handling, and leverages diffusion priors to enforce geometry completeness in both posed and canonical spaces, achieving state-of-the-art performance on ZJU-MoCap and OcMotion with only about 10 minutes of training. The approach offers a practical, efficient solution for occluded human rendering in monocular videos, enabling robust novel-view synthesis with strong qualitative and quantitative results.

Abstract

Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

TL;DR

Occluded humans in monocular videos present a major challenge for 3D rendering. OccFusion combines 3D Gaussian splatting with pretrained 2D diffusion priors in a three-stage pipeline—Initialization, Optimization with Score Distillation Sampling, and Refinement with in-context inpainting—to recover complete geometry and faithful appearance under occlusion. It introduces OccGauHuman, a streamlined GauHuman variant tailored for occlusion handling, and leverages diffusion priors to enforce geometry completeness in both posed and canonical spaces, achieving state-of-the-art performance on ZJU-MoCap and OcMotion with only about 10 minutes of training. The approach offers a practical, efficient solution for occluded human rendering in monocular videos, enabling robust novel-view synthesis with strong qualitative and quantitative results.

Abstract

Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.
Paper Structure (35 sections, 6 equations, 9 figures, 2 tables)

This paper contains 35 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Reconstructing humans from monocular videos frequently fails under occlusion. In this paper, we introduce OccFusion, a method that combines 3D Gaussian splatting with 2D diffusion priors for modeling occluded humans. Our method outperforms the state-of-the-art in rendering quality and efficiency, resulting in clean and complete renderings free of artifacts.
  • Figure 2: OccFusion achieves occluded human rendering via three sequential stages. In the Initialization Stage, we recover complete binary human masks $\{\mathbf{\hat{M}}\}$ from occluded partial observations $\{\mathbf{I}\}$ with the help of segmentation priors $\{\mathbf{M}\}$ and pose priors $\{\mathbf{P}\}$. $\{\mathbf{\hat{M}}\}$ will be further used to help optimize human Gaussians $\Pi$ in subsequent stages. In the Optimization Stage, we apply $\{\mathbf{P}\}$ conditioned SDS on both posed human and canonical human to enforce the human occupancy to remain complete. In the Refinement Stage, we use the coarse human renderings $\{\mathbf{\hat{I}}\}$ from the Optimization Stage to help generate missing RGB values in $\{\mathbf{I}\}$ through our proposed in-context inpainting. Through this process, both the appearance and geometry of the human are fine-tuned to be in high fidelity. Training of all three stages takes only 10 minutes on a single Titan RTX GPU.
  • Figure 3: Stable Diffusion 1.5 generations SD conditioned on a challenging pose $\mathbf{P}$. While conditioning on the original pose results in multiple limbs and other abnormalities, our method of simplifying pose by removing self-occluded joints results in more feasible generations.
  • Figure 4: While generative models provide inconsistent inpainting results, the binary masks that can be extracted from these generated images are much more consistent.
  • Figure 5: Qualitative comparisons on simulated occlusions in the ZJU-MoCap dataset peng2021neural (left column) and real-world occlusions in the OcMotion dataset huang2022object (right column). ON denotes OccNeRF occnerf and OGH denotes OccGauHuman.
  • ...and 4 more figures