ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality
Mingzhi Zhu, Ding Shang, Sai Qian Zhang
TL;DR
ESCA tackles the challenge of real-time, photorealistic Codec Avatars on resource-constrained VR devices by jointly optimizing low-bit quantization and hardware acceleration. It introduces four tightly integrated components—ICAS, FFAS, UV-Weighted PTQ, and an input-combining PCA accelerator—to preserve facial detail while enabling 4-/8-bit decoding at high throughput. Experimental results show up to +0.39 VDP gains over the best 4-bit baselines, a 3.36× reduction in decoder latency, and end-to-end rendering at 100 FPS, validating real-time edge deployment. This full-stack co-design provides a practical pathway to immersive, portable VR experiences with high-fidelity avatar rendering.
Abstract
Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.
