R3-Avatar: Record and Retrieve Temporal Codebook for Reconstructing Photorealistic Human Avatars
Yifan Zhan, Wangze Xu, Qingtian Zhu, Muyao Niu, Mingze Ma, Yifei Liu, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
TL;DR
R3-Avatar introduces a record-retrieve-reconstruct framework to balance high-fidelity rendering and animatability for photorealistic human avatars. It records temporal appearance variations in a timestamp-conditioned codebook via a hex-plane spatio-temporal encoder and a 4D Gaussian decoder, then retrieves timestamps for novel poses using body-part-wise pose similarity and temporal smoothing to enable smooth animation. The approach achieves state-of-the-art performance on datasets with complex clothing, outperforming both rendering-based and animatable baselines in novel-view rendering and novel-pose animation. By decoupling non-rigid appearance from rigid motion and using a retrieval-based augmentation of appearance, it enables robust generalization to unseen poses and supports integration with 3D Gaussian Splatting pipelines for immersive AR/VR applications.
Abstract
We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a "record-retrieve-reconstruct" strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.
