Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering
Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola, Giljoo Nam
TL;DR
GPiCA tackles realistic head avatars for mobile rendering by marrying a textured mesh with anisotropic 3D Gaussians in a unified, differentiable pipeline. The encoder-decoder setup predicts a mesh, texture, and Gaussian set conditioned on a latent code, while a two-pass hybrid renderer blends front/back Gaussians with the mesh in UV space. Empirically, GPiCA achieves hair- and volume-rich realism with substantially fewer Gaussians than pure GS baselines and maintains mesh-like rendering efficiency, enabling near real-time performance on Quest-class hardware. The work demonstrates that jointly training a semi-transparent mesh and sparse Gaussians yields high-fidelity avatars that are suitable for mobile AR/VR applications. Overall, GPiCA advances efficient, photorealistic avatar rendering by leveraging the complementary strengths of mesh surfaces and volumetric Gaussians within a differentiable framework.
Abstract
We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.
