Table of Contents
Fetching ...

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam, Lingchen Yang, Jihyun Lee, Egor Zakharov, Zhaoen Su, Rinat Abdrashitov, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Ariyan Zarei, Marco Pesavento, Yichen Xu, He Wen, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito

Abstract

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Abstract

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Paper Structure

This paper contains 24 sections, 14 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Large-scale Codec Avatars (LCA). (Left) We generate avatars from a handful of images in seconds. LCA follows the pre/post-training paradigm, achieving broad generalization with high-fidelity reconstruction over pretraining alone. (Middle) The resulting avatars are highly detailed with faithful 3D structures, fully animatable with expression, gaze, and body pose, even for out-of-domain samples. (Right) LCA further supports loose garments and relighting while retaining generalizability by only modifying the post-training.
  • Figure 2: (Left) Overview. Given multiple images of a subject, we extract image tokens from full-body images and face crops, and geometric tokens from a template mesh. The LCA encoder alternates image-only, geometry-only, and multimodal attention to fuse information across streams. Our decoders, canonical and pose-dependent, predict Gaussian attributes, which are skinned via linear blend skinning (LBS) and rendered to novel views. Training uses photometric reconstruction losses. (Right) Pretraining vs. Post-Training. LCA pretrains on large-scale, unconstrained monocular videos of single subjects with mixed (mid/low) quality, then post-trains on high-quality, multi-view studio captures. Pretraining drives broad generalization whereas post-training improves fidelity and 3D completeness.
  • Figure 3: Node-Based Deformation Model. We use a flexible two-level learnable deformation model to adapt skinning weights learning for post-training.
  • Figure 4: Pretraining vs. Post-Training. Qualitative comparison of models trained on multiple data sources and training strategies.
  • Figure 5: Qualitative Comparison with State-of-the-Art Methods. LCA outperforms in both multi-view and monocular settings.
  • ...and 5 more figures