Table of Contents
Fetching ...

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-time Mobile Telepresence

Yonggan Fu, Yuecheng Li, Chenghui Li, Jason Saragih, Peizhao Zhang, Xiaoliang Dai, Yingyan Celine Lin

TL;DR

Auto-CARD tackles the on-device encoding bottleneck of Codec Avatars for AR/VR by reducing architectural and temporal redundancies. It introduces AVE-NAS, a hardware-aware neural architecture search tailored for avatar encoders, and LATEX, a latency-aware latent extrapolation scheme that skips redundant frames. On real devices, the method achieves up to a 5.05× speed-up with rendering quality comparable to or better than state-of-the-art encoders, while LATEX provides additional frame skipping with minimal quality loss. Together, AVE-NAS and LATEX offer a practical path toward real-time, privacy-preserving photorealistic telepresence on mainstream AR/VR hardware.

Abstract

Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs.

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-time Mobile Telepresence

TL;DR

Auto-CARD tackles the on-device encoding bottleneck of Codec Avatars for AR/VR by reducing architectural and temporal redundancies. It introduces AVE-NAS, a hardware-aware neural architecture search tailored for avatar encoders, and LATEX, a latency-aware latent extrapolation scheme that skips redundant frames. On real devices, the method achieves up to a 5.05× speed-up with rendering quality comparable to or better than state-of-the-art encoders, while LATEX provides additional frame skipping with minimal quality loss. Together, AVE-NAS and LATEX offer a practical path toward real-time, privacy-preserving photorealistic telepresence on mainstream AR/VR hardware.

Abstract

Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs.
Paper Structure (20 sections, 4 equations, 6 figures, 3 tables)

This paper contains 20 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An overview of our Auto-CARD framework, integrating the proposed (a) AVE-NAS and (b) LATEX techniques for minimizing the model and temporal redundancy, respectively.
  • Figure 2: Visualize the rendered avatars by the searched encoders w/o and w/ our proposed objectives, where the normalized FLOPs distributions across three views are annotated.
  • Figure 3: Visualize the rendered avatar (in the middle) decoded from the latent code $z_t$ based on the interpolation of $z_0$ and $z_T$.
  • Figure 4: Benchmark the rendering quality achieved by our searched AVE-M against SOTA encoder EEM Gabriel20 (zoom-in for better view).
  • Figure 5: Visualize the rendered expressions of the encoders searched w/o and w/ our proposed objective.
  • ...and 1 more figures