Table of Contents
Fetching ...

R2Human: Real-Time 3D Human Appearance Rendering from a Single Image

Yuanwang Yang, Qiao Feng, Yu-Kun Lai, Kun Li

TL;DR

R2Human tackles the problem of real-time, photorealistic 3D human appearance rendering from a single image. It introduces a Z-map to unify implicit texture fields with explicit neural rendering, and leverages Fourier Occupancy Fields as priors to enable efficient, coherent texture generation and 3D sampling. The approach includes a pixel-aligned feature encoder that fuses FOF and normal maps, a rendering network that warps and synthesizes views, and training losses for multi-view consistency, pixel accuracy, and perceptual quality, yielding state-of-the-art results with real-time performance (28+ FPS reported on optimized hardware). These contributions advance holographic communication and VR/AR by enabling high-fidelity, monocular 3D human appearance with practical inference speed, while acknowledging privacy considerations in high-fidelity synthetic human rendering.

Abstract

Rendering 3D human appearance from a single image in real-time is crucial for achieving holographic communication and immersive VR/AR. Existing methods either rely on multi-camera setups or are constrained to offline operations. In this paper, we propose R2Human, the first approach for real-time inference and rendering of photorealistic 3D human appearance from a single image. The core of our approach is to combine the strengths of implicit texture fields and explicit neural rendering with our novel representation, namely Z-map. Based on this, we present an end-to-end network that performs high-fidelity color reconstruction of visible areas and provides reliable color inference for occluded regions. To further enhance the 3D perception ability of our network, we leverage the Fourier occupancy field as a prior for generating the texture field and providing a sampling surface in the rendering stage. We also propose a consistency loss and a spatial fusion strategy to ensure the multi-view coherence. Experimental results show that our method outperforms the state-of-the-art methods on both synthetic data and challenging real-world images, in real-time. The project page can be found at http://cic.tju.edu.cn/faculty/likun/projects/R2Human.

R2Human: Real-Time 3D Human Appearance Rendering from a Single Image

TL;DR

R2Human tackles the problem of real-time, photorealistic 3D human appearance rendering from a single image. It introduces a Z-map to unify implicit texture fields with explicit neural rendering, and leverages Fourier Occupancy Fields as priors to enable efficient, coherent texture generation and 3D sampling. The approach includes a pixel-aligned feature encoder that fuses FOF and normal maps, a rendering network that warps and synthesizes views, and training losses for multi-view consistency, pixel accuracy, and perceptual quality, yielding state-of-the-art results with real-time performance (28+ FPS reported on optimized hardware). These contributions advance holographic communication and VR/AR by enabling high-fidelity, monocular 3D human appearance with practical inference speed, while acknowledging privacy considerations in high-fidelity synthetic human rendering.

Abstract

Rendering 3D human appearance from a single image in real-time is crucial for achieving holographic communication and immersive VR/AR. Existing methods either rely on multi-camera setups or are constrained to offline operations. In this paper, we propose R2Human, the first approach for real-time inference and rendering of photorealistic 3D human appearance from a single image. The core of our approach is to combine the strengths of implicit texture fields and explicit neural rendering with our novel representation, namely Z-map. Based on this, we present an end-to-end network that performs high-fidelity color reconstruction of visible areas and provides reliable color inference for occluded regions. To further enhance the 3D perception ability of our network, we leverage the Fourier occupancy field as a prior for generating the texture field and providing a sampling surface in the rendering stage. We also propose a consistency loss and a spatial fusion strategy to ensure the multi-view coherence. Experimental results show that our method outperforms the state-of-the-art methods on both synthetic data and challenging real-world images, in real-time. The project page can be found at http://cic.tju.edu.cn/faculty/likun/projects/R2Human.
Paper Structure (19 sections, 13 equations, 10 figures, 5 tables)

This paper contains 19 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The overall pipeline of R$^2$Human for real-time 3D human appearance rendering. R$^2$Human leverages the proposed Z-map to combine the strengths of implicit texture field and explicit neural rendering seamlessly. With our consistency loss, we constrain the rendered color of the same visible point to be consistent across different views, thereby ensuring the multi-view consistency of the results.
  • Figure 2: Novel view rendering on THuman2.0 (top tow rows) and 2k2k dataset (bottom tow rows).
  • Figure 3: Qualitative comparison with 3DTexture. The results show that even with four views, our single-view approach still outperforms traditional texture mapping.
  • Figure 4: Ablation study comparing our model with and without the Z-map in the decoder.
  • Figure 5: Ablation study comparing our model with and without the normal in the rendering network.
  • ...and 5 more figures