Table of Contents
Fetching ...

Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement

Chia-Hern Lai, I-Hsuan Lo, Yen-Ku Yeh, Thanh-Nguyen Truong, Ching-Chun Huang

TL;DR

Blur2Sharp tackles the challenge of photorealistic, geometrically consistent human avatar synthesis under novel poses and views from a single image. It fuses a generalizable Human NeRF with a diffusion-based generative prior, guided by multi-layer SMPL priors (texture, normal, semantic) and a reference-knowledge transfer mechanism. The method introduces a dual-domain RGB-Normal diffusion model and a Multi-Layer Geometry Fusion module to balance global structure with fine detail, enabling sharp, view-consistent outputs even for loose clothing and occlusions. Across two large datasets, Blur2Sharp achieves state-of-the-art performance on both novel pose and novel view synthesis, with strong generalization and informative ablations supporting the effectiveness of its components.

Abstract

The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.

Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement

TL;DR

Blur2Sharp tackles the challenge of photorealistic, geometrically consistent human avatar synthesis under novel poses and views from a single image. It fuses a generalizable Human NeRF with a diffusion-based generative prior, guided by multi-layer SMPL priors (texture, normal, semantic) and a reference-knowledge transfer mechanism. The method introduces a dual-domain RGB-Normal diffusion model and a Multi-Layer Geometry Fusion module to balance global structure with fine detail, enabling sharp, view-consistent outputs even for loose clothing and occlusions. Across two large datasets, Blur2Sharp achieves state-of-the-art performance on both novel pose and novel view synthesis, with strong generalization and informative ablations supporting the effectiveness of its components.

Abstract

The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.

Paper Structure

This paper contains 48 sections, 6 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Based on a single reference image, the proposed Blur2Sharp significantly enhances photorealism in novel pose and view synthesis. It introduces a generative refinement module conditioned on multi-level geometric information to effectively resolve the coarse and blurred artifacts typically observed in previous approaches sherf, thus achieving sharper visual quality and robust geometric consistency.
  • Figure 2: System architecture. Our framework operates in two main stages: (1) a generalizable Human NeRF module sherf that maps the reference image to a canonical space to generate initial novel view renderings under target body parameters $\bm{\beta}_{\text{tgt}}$ and pose parameters $\bm{\theta}_{\text{tgt}}$; and (2) a refinement network that leverages a multi-view denoising U-Net conditioned on the NeRF's RGB renderings, estimated normal maps, and additional geometric features to produce enhanced RGB and Normal maps. Specifically, the MLGF module generates the geometric features by fusing the SMPL texture priors with two SMPL geometric priors (normal and semantic maps). To further preserve subject-specific details, the refinement network incorporates a Reference Knowledge Transfer Module that propagates features from the reference image and normal map via reference attention.
  • Figure 3: Qualitative comparisons of novel pose synthesis from multiple views on MVHumanNet and HuMMan, showcasing our method alongside SHERF, Animate Anyone, and Champ. Overall, our method yields more accurate poses and more consistent appearance than prior methods. Red boxes indicate enlarged regions, while blue boxes highlight areas with additional artifacts.
  • Figure 4: Qualitative results of novel view synthesis across 4 views on MVHumanNet and HuMMan dataset. Our method shows improved realism and consistency with fewer artifacts. Red boxes indicate enlarged regions, while blue boxes highlight areas with additional artifacts.
  • Figure 5: Qualitative ablation study on MVHumanNet dataset. $F_{geo}$ and $I_{coarse}$ denotes geometry features and image rendered by Human NeRF respectively. Red boxes indicate enlarged regions.
  • ...and 12 more figures