Table of Contents
Fetching ...

R3-Avatar: Record and Retrieve Temporal Codebook for Reconstructing Photorealistic Human Avatars

Yifan Zhan, Wangze Xu, Qingtian Zhu, Muyao Niu, Mingze Ma, Yifei Liu, Zhihang Zhong, Xiao Sun, Yinqiang Zheng

TL;DR

R3-Avatar introduces a record-retrieve-reconstruct framework to balance high-fidelity rendering and animatability for photorealistic human avatars. It records temporal appearance variations in a timestamp-conditioned codebook via a hex-plane spatio-temporal encoder and a 4D Gaussian decoder, then retrieves timestamps for novel poses using body-part-wise pose similarity and temporal smoothing to enable smooth animation. The approach achieves state-of-the-art performance on datasets with complex clothing, outperforming both rendering-based and animatable baselines in novel-view rendering and novel-pose animation. By decoupling non-rigid appearance from rigid motion and using a retrieval-based augmentation of appearance, it enables robust generalization to unseen poses and supports integration with 3D Gaussian Splatting pipelines for immersive AR/VR applications.

Abstract

We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a "record-retrieve-reconstruct" strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.

R3-Avatar: Record and Retrieve Temporal Codebook for Reconstructing Photorealistic Human Avatars

TL;DR

R3-Avatar introduces a record-retrieve-reconstruct framework to balance high-fidelity rendering and animatability for photorealistic human avatars. It records temporal appearance variations in a timestamp-conditioned codebook via a hex-plane spatio-temporal encoder and a 4D Gaussian decoder, then retrieves timestamps for novel poses using body-part-wise pose similarity and temporal smoothing to enable smooth animation. The approach achieves state-of-the-art performance on datasets with complex clothing, outperforming both rendering-based and animatable baselines in novel-view rendering and novel-pose animation. By decoupling non-rigid appearance from rigid motion and using a retrieval-based augmentation of appearance, it enables robust generalization to unseen poses and supports integration with 3D Gaussian Splatting pipelines for immersive AR/VR applications.

Abstract

We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a "record-retrieve-reconstruct" strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.

Paper Structure

This paper contains 30 sections, 5 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: (a) A case to show appearance ambiguity. The skirts of a same posed human look different due to different motion patterns (fall and jump) . (b) Our disambiguating "record-retrieve-reconstruct" strategy compared to the naive pose-input training pipeline, which is easily exposed to appearance ambiguity. (c) A glimpse of the results of the rendering and animation.
  • Figure 2: The pipeline of $R^3$-Avatar. In the "recording phase" (\ref{['subsec: Multi-plane Spatio-temporal Encoder']} and \ref{['subsec: 4D Gaussian Decoder']}), a temporal codebook is used to capture human appearance variations over time. In the "retrieving phase" (\ref{['subsec: Appearance Retrieving for Rendering']}), retrieving the temporal codebook enables high-fidelity novel-view rendering and novel-pose animation.
  • Figure 3: Novel-view rendering of our method and other baselines. We compare on DNA-Rendering dataset cheng2023dna (first, second rows), ZJU_MoCap dataset peng2021neural (third row) and MVHumanNet dataset xiong2024mvhumannet (last row) here.
  • Figure 4: Novel-pose animating of our method and other baselines on DNA-Rendering dataset cheng2023dna.
  • Figure 5: Novel-view rendering on HiFi4G dataset jiang2024hifi4g.
  • ...and 13 more figures