PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Antonio Oroz, Matthias Nießner, Tobias Kirschstein
TL;DR
PercHead tackles single-image 3D head reconstruction and editing by introducing a unified base model with a dual-branch lift encoder and a ViT-based 3D decoder that lifts 2D features into a 3D canonical space. Rendering uses Gaussian Splatting, while perceptual supervision from DINOv2 and SAM2.1 provides robust, generalizable signals for geometry and appearance without relying on traditional reconstruction losses. The editing variant swaps the encoder to condition on segmentation maps for geometry and CLIP-informed style inputs (image or text), enabling disentangled 3D editing with an intuitive GUI. Quantitative and qualitative results show state-of-the-art performance in novel-view and extreme-view reconstruction, strong identity preservation, and flexible, zero-shot text-guided editing, signaling broad practical impact for avatars, telepresence, and AR/VR workflows.
Abstract
We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
