Table of Contents
Fetching ...

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Antonio Oroz, Matthias Nießner, Tobias Kirschstein

TL;DR

PercHead tackles single-image 3D head reconstruction and editing by introducing a unified base model with a dual-branch lift encoder and a ViT-based 3D decoder that lifts 2D features into a 3D canonical space. Rendering uses Gaussian Splatting, while perceptual supervision from DINOv2 and SAM2.1 provides robust, generalizable signals for geometry and appearance without relying on traditional reconstruction losses. The editing variant swaps the encoder to condition on segmentation maps for geometry and CLIP-informed style inputs (image or text), enabling disentangled 3D editing with an intuitive GUI. Quantitative and qualitative results show state-of-the-art performance in novel-view and extreme-view reconstruction, strong identity preservation, and flexible, zero-shot text-guided editing, signaling broad practical impact for avatars, telepresence, and AR/VR workflows.

Abstract

We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

TL;DR

PercHead tackles single-image 3D head reconstruction and editing by introducing a unified base model with a dual-branch lift encoder and a ViT-based 3D decoder that lifts 2D features into a 3D canonical space. Rendering uses Gaussian Splatting, while perceptual supervision from DINOv2 and SAM2.1 provides robust, generalizable signals for geometry and appearance without relying on traditional reconstruction losses. The editing variant swaps the encoder to condition on segmentation maps for geometry and CLIP-informed style inputs (image or text), enabling disentangled 3D editing with an intuitive GUI. Quantitative and qualitative results show state-of-the-art performance in novel-view and extreme-view reconstruction, strong identity preservation, and flexible, zero-shot text-guided editing, signaling broad practical impact for avatars, telepresence, and AR/VR workflows.

Abstract

We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

Paper Structure

This paper contains 24 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: PercHead. Our method reconstructs high-fidelity 3D heads from single input images, maintaining consistency across arbitrary viewpoints. Beyond reconstruction, our fine-tuned editing model enables realistic 3D head generation from a segmentation map as geometric input, with style controlled via a reference image or text prompt.
  • Figure 2: Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. Both tasks share a 3D ViT decoder that lifts 2D features via iterative cross-attention, differing only in the encoder. The reconstruction model uses a dual-branch encoder with DINOv2 and a task-specific ViT; the editing model uses a segmentation ViT and injects a global CLIP style token. Outputs are rendered via Gaussian Splatting and refined with a 2D CNN, with supervision from DINOv2 and SAM2.1.
  • Figure 3: Qualitative Evaluation on Samples From Ava-256 and NeRSemble.
  • Figure 4: 3D Reconstructions Across Video Frames. Our model maintains consistent geometry and appearance across time, enabling coherent 3D avatar lifting while capturing subtle expression changes like mouth, eye, and eyelid movements.
  • Figure 5: Text-Based 3D Editing. Given a fixed segmentation map and varying text prompts, our model generates diverse 3D heads with consistent geometry. Styles are guided by text, enabling low-level (e.g., hair color) and high-level (e.g., age) edits. Despite no text-specific training, our model achieves zero-shot editing via the vision-aligned CLIP text encoder.
  • ...and 4 more figures