Table of Contents
Fetching ...

Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering

Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola, Giljoo Nam

TL;DR

GPiCA tackles realistic head avatars for mobile rendering by marrying a textured mesh with anisotropic 3D Gaussians in a unified, differentiable pipeline. The encoder-decoder setup predicts a mesh, texture, and Gaussian set conditioned on a latent code, while a two-pass hybrid renderer blends front/back Gaussians with the mesh in UV space. Empirically, GPiCA achieves hair- and volume-rich realism with substantially fewer Gaussians than pure GS baselines and maintains mesh-like rendering efficiency, enabling near real-time performance on Quest-class hardware. The work demonstrates that jointly training a semi-transparent mesh and sparse Gaussians yields high-fidelity avatars that are suitable for mobile AR/VR applications. Overall, GPiCA advances efficient, photorealistic avatar rendering by leveraging the complementary strengths of mesh surfaces and volumetric Gaussians within a differentiable framework.

Abstract

We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.

Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering

TL;DR

GPiCA tackles realistic head avatars for mobile rendering by marrying a textured mesh with anisotropic 3D Gaussians in a unified, differentiable pipeline. The encoder-decoder setup predicts a mesh, texture, and Gaussian set conditioned on a latent code, while a two-pass hybrid renderer blends front/back Gaussians with the mesh in UV space. Empirically, GPiCA achieves hair- and volume-rich realism with substantially fewer Gaussians than pure GS baselines and maintains mesh-like rendering efficiency, enabling near real-time performance on Quest-class hardware. The work demonstrates that jointly training a semi-transparent mesh and sparse Gaussians yields high-fidelity avatars that are suitable for mobile AR/VR applications. Overall, GPiCA advances efficient, photorealistic avatar rendering by leveraging the complementary strengths of mesh surfaces and volumetric Gaussians within a differentiable framework.

Abstract

We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.

Paper Structure

This paper contains 27 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview. Given an expression latents code and view direction, our model outputs 3D Gaussians and a mesh in the UV texture space. The Gaussian positions are predicted relative to the mesh. A sampling mask is used to select 3D Gaussians for key regions like hair, beard etc. In the first pass, the mesh is rasterized to produce RGB texture, opacity and depth. In the second pass the 3D Gaussians, rasterized texture, depth and opacity are sent to our differentiable hybrid renderer. The renderer first accumulates colors of Gaussians in front of the mesh, then accumulate the mesh component and finally accumulates Gaussians behind the mesh. The system is trained jointly.
  • Figure 2: Learned mesh and Gaussian primitives.Top row: Per-vertex normals of the learned mesh. Middle row: Learned Gaussians as 3D ellipses. Bottom row: Final renders.
  • Figure 3: Comparing with vanilla Gaussian avatars. With only 16k Gaussians, vanilla Gaussian avatars struggle to capture facial details. In contrast, our hybrid avatars, also using 16k 3D Gaussians, achieve significantly sharper representations. They are comparable to vanilla Gaussian avatars with 65k 3D Gaussians which are much slower to render.
  • Figure 4: Mesh and Gaussian opacity contribution. First row: Mesh is colored green and Gaussians are colored blue. Second row: RGB contribution from the mesh. Third row: RGB contribution from the Gaussians. Fourth row: Final renders.
  • Figure 5: Comparing with mesh based avatars. PiCA can only capture flat surfaces and struggles with complex areas like hair, where it coarsely approximates volumetric details. Our Hybrid GS avatars overcome this limitation by adding small number of 3D Gaussians.
  • ...and 2 more figures