Table of Contents
Fetching ...

OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar

Jianqiang Ren, Lin Liu, Steven Hoi

TL;DR

The proposed OMG-Avatar is a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s and introduces a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details.

Abstract

We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.

OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar

TL;DR

The proposed OMG-Avatar is a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s and introduces a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details.

Abstract

We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
Paper Structure (22 sections, 13 equations, 12 figures, 6 tables)

This paper contains 22 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The overall pipeline of OMG-Avatar framework. Our method extracts global features via cross-attention and local details via projection-based sampling, which are fused under the guidance of depth buffers. A coarse-to-fine strategy is proposed to facilitate hierarchical detail perception. The head and shoulder are predicted separately using shared features and then combined for rendering.
  • Figure 2: Since the original FLAME model lacks shoulder regions and sufficient geometric representation for high-resolution LOD requirements, we progressively subdivide the head mesh during training and enhance it with predicted shoulder geometry through cross-region combination. Sub #$i$ indicates that the head mesh has been subdivided $i$ times.
  • Figure 3: Cross-identity reenactment results on VFHQ and HDTF datasets.
  • Figure 4: Our local-global feature fusion (OAFF) and multi-region fusion strategy significantly improve identity consistency and completeness in non-head regions. The neural refiner further boosts visual fidelity, especially for dynamic facial expressions.
  • Figure 5: The correlation between Gaussian count and reconstruction performance on the VFHQ dataset.
  • ...and 7 more figures