Table of Contents
Fetching ...

ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Mingzhi Zhu, Ding Shang, Sai Qian Zhang

TL;DR

ESCA tackles the challenge of real-time, photorealistic Codec Avatars on resource-constrained VR devices by jointly optimizing low-bit quantization and hardware acceleration. It introduces four tightly integrated components—ICAS, FFAS, UV-Weighted PTQ, and an input-combining PCA accelerator—to preserve facial detail while enabling 4-/8-bit decoding at high throughput. Experimental results show up to +0.39 VDP gains over the best 4-bit baselines, a 3.36× reduction in decoder latency, and end-to-end rendering at 100 FPS, validating real-time edge deployment. This full-stack co-design provides a practical pathway to immersive, portable VR experiences with high-fidelity avatar rendering.

Abstract

Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

TL;DR

ESCA tackles the challenge of real-time, photorealistic Codec Avatars on resource-constrained VR devices by jointly optimizing low-bit quantization and hardware acceleration. It introduces four tightly integrated components—ICAS, FFAS, UV-Weighted PTQ, and an input-combining PCA accelerator—to preserve facial detail while enabling 4-/8-bit decoding at high throughput. Experimental results show up to +0.39 VDP gains over the best 4-bit baselines, a 3.36× reduction in decoder latency, and end-to-end rendering at 100 FPS, validating real-time edge deployment. This full-stack co-design provides a practical pathway to immersive, portable VR experiences with high-fidelity avatar rendering.

Abstract

Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to over the best 4-bit baseline, delivers up to latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

Paper Structure

This paper contains 21 sections, 26 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The left panel shows the maximum activation value of each channel of a sample input. The right part shows the aggregated activation distribution over all spatial locations and channels.
  • Figure 2: VAE Framework of Codec Avatar models.
  • Figure 3: (a) Execution pipeline of the entire Codec Avatar system. (b) Architecture of normalized VR headset SoC. (c) Illustration of transposed convolution. Purple squares represent non-zero activation, and white squares represent zero activation.
  • Figure 4: Convert the Codec Avatar decoder to a quantized model. The pipeline consists of three main components: (a) The original Codec Avatar decoder, (b) the UV-Weighted Post-Training Quantization, and (c) the decoder layer after Input Channel-wise Activation Smoothing. The smooth operation is designed to reduce the difficulty of quantizing activations, while the UV-PTQ method uses a UV weight map to guide the quantization process. Together, these techniques enable efficient and accurate quantization of the Codec Avatar decoder for real-time inference on VR headsets.
  • Figure 5: (a) Architecture of the proposed hardware accelerator for Codec Avatar inference. (b) Input-combining tiling scheme applied to the activation matrix. Red lines partition the input activation into smaller tiles. Purple/white squares denote non-zero/zero activations, and yellow squares are zero but can be assumed non-zero for simplicity. (c) Internal architecture of the proposed PE.
  • ...and 1 more figures