Table of Contents
Fetching ...

Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance

Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stefanos Zafeiriou

TL;DR

Arc2Avatar introduces an SDS-based method to generate realistic, identity-preserving 3D head avatars from a single image by leveraging Arc2Face identity priors within a 3D Gaussian Splat framework aligned to a FLAME mesh. It extends Arc2Face for diverse-view generation via synthetic multi-view data and LoRA fine-tuning, enabling 360° head synthesis with strong identity retention. A masked 3DGS optimization preserves facial-template correspondence while enabling blendshape-driven expressions, with an optional SDS refinement step to correct extreme expressions. Empirical results show state-of-the-art realism and identity preservation across views, with favorable FID and user-study results, while acknowledging limitations and ethical considerations for realistic avatar synthesis.

Abstract

Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in reconstructing detailed 3D scenes within multi-view setups and the emergence of large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based method utilizing a human face foundation model as guidance with just a single image as input. To achieve that, we extend such a model for diverse-view human head generation by fine-tuning on synthetic data and modifying its conditioning. Our avatars maintain a dense correspondence with a human face mesh template, allowing blendshape-based expression generation. This is achieved through a modified 3DGS approach, connectivity regularizers, and a strategic initialization tailored for our task. Additionally, we propose an optional efficient SDS-based correction step to refine the blendshape expressions, enhancing realism and diversity. Experiments demonstrate that Arc2Avatar achieves state-of-the-art realism and identity preservation, effectively addressing color issues by allowing the use of very low guidance, enabled by our strong identity prior and initialization strategy, without compromising detail. Please visit https://arc2avatar.github.io for more resources.

Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance

TL;DR

Arc2Avatar introduces an SDS-based method to generate realistic, identity-preserving 3D head avatars from a single image by leveraging Arc2Face identity priors within a 3D Gaussian Splat framework aligned to a FLAME mesh. It extends Arc2Face for diverse-view generation via synthetic multi-view data and LoRA fine-tuning, enabling 360° head synthesis with strong identity retention. A masked 3DGS optimization preserves facial-template correspondence while enabling blendshape-driven expressions, with an optional SDS refinement step to correct extreme expressions. Empirical results show state-of-the-art realism and identity preservation across views, with favorable FID and user-study results, while acknowledging limitations and ethical considerations for realistic avatar synthesis.

Abstract

Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in reconstructing detailed 3D scenes within multi-view setups and the emergence of large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based method utilizing a human face foundation model as guidance with just a single image as input. To achieve that, we extend such a model for diverse-view human head generation by fine-tuning on synthetic data and modifying its conditioning. Our avatars maintain a dense correspondence with a human face mesh template, allowing blendshape-based expression generation. This is achieved through a modified 3DGS approach, connectivity regularizers, and a strategic initialization tailored for our task. Additionally, we propose an optional efficient SDS-based correction step to refine the blendshape expressions, enhancing realism and diversity. Experiments demonstrate that Arc2Avatar achieves state-of-the-art realism and identity preservation, effectively addressing color issues by allowing the use of very low guidance, enabled by our strong identity prior and initialization strategy, without compromising detail. Please visit https://arc2avatar.github.io for more resources.
Paper Structure (36 sections, 8 equations, 15 figures, 2 tables)

This paper contains 36 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Arc2Avatar creates detailed 3D head avatars from a single image with unprecedented realism and identity similarity through a carefully designed score distillation sampling approach on top of an adapted 2D face foundation model. Our method supports blendshape-driven expression generation by retaining dense correspondence between the 3D Gaussian Splats and an underlying facial mesh template.
  • Figure 2: Overview of the proposed 3D generation framework. Our method uses an adapted Arc2Face diffusion model paraperas2024arc2face, augmented for diverse view generation through fine-tuning on PanoHead PanoHead samples. For 3D generation, starting with a frontal image, we extract the Arc2Face embedding and initialize Gaussian Splats on each vertex of the FLAME head model FLAME:SiggraphAsia2017, fitting the facial area to the mean facial texture. We then apply an SDS alternative, where each iteration combines the Arc2Face embedding with a CLIP-encoded view embedding to denoise the renderings and update the splats. Initially, only facial splats are optimized for a set number of iterations. Subsequently, all splats are refined with densification, pruning, and opacity resets disabled for the facial area. Dense mesh correspondence is maintained through targeted initialization, avoidance of the standard 3DGS modifications in the facial region, and mesh regularizers adhering to the underlying template. Therefore, our method enables straightforward avatar expressions via blendshapes and shows potential for expression refinement after blendshape application using the same framework with minimal steps.
  • Figure 2: Preference user study results.
  • Figure 3: LoRA hu2022lora fine-tuning of Arc2Face paraperas2024arc2face with PanoHead PanoHead samples. The generation is conditioned on the frontal sample.
  • Figure 4: Qualitative comparison of competing methods for celebrities. ID-methods use an average ID-embedding from multiple images, although they can also use that of a single image.
  • ...and 10 more figures