Table of Contents
Fetching ...

Real-Time Human Frontal View Synthesis from a Single Image

Fangyu Lin, Yingdong Hu, Lunjie Zhu, Zhening Liu, Yushi Huang, Zehong Lin, Jun Zhang

Abstract

Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.

Real-Time Human Frontal View Synthesis from a Single Image

Abstract

Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.
Paper Structure (23 sections, 15 equations, 13 figures, 4 tables)

This paper contains 23 sections, 15 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Real-time high-fidelity human frontal view synthesis. PrismMirror achieves a $5\times$ inference speedup ($\sim$42 ms vs. 206.4 ms) over HumanRAM, while maintaining superior visual quality and reaching real-time frame rates of 24+ FPS.
  • Figure 2: Overview of the PrismMirror architecture. The framework operates through three cascaded stages: encoding global context, injecting explicit geometric priors (SMPL-X and point clouds), and decoding into NVS or 3DGS (accelerated by a progressive linear attention distillation strategy).
  • Figure 3: Visualization of geometry feature injection. By generating explicit spatial priors (point clouds and SMPL-X) inside the model, our architecture effectively anchors pure data-driven features to capture precise body topology and high-frequency details.
  • Figure 4: Progressive distillation.
  • Figure 5: Qualitative comparisons on THuman2.1 and THumanSit. PrismMirror synthesizes sharper details in high-frequency regions (e.g., faces and hands) compared to baselines, successfully avoiding severe floating artifacts.
  • ...and 8 more figures