Reference-Based 3D-Aware Image Editing with Triplanes
Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, Aysegul Dundar
TL;DR
The paper introduces a reference-based, 3D-aware image editing framework built on triplane representations (EG3D) to enable faithful, view-consistent edits from a single image. It localizes editing regions in triplane space via gradient-propagated segmentation masks, then achieves seamless fusion through an implicit fusion step powered by a fine-tuned encoder and a canonical-pose rendering, blending with $T_{ ext{f}} = ext{E}(M_{ ext{ref}}) \cdot T_{ ext{ref}} + \text{E}(M_{ ext{src}}) \cdot T_{ ext{src}} + (\text{E}(M_{ ext{src}}) - \text{E}(M_{ ext{ref}})) \cdot T_{ ext{imp}}$. A targeted encoder fine-tuning stage uses masked, multi-view losses (LPIPS and ArcFace identity) to improve boundary realism and reduce background leakage. The approach demonstrates state-of-the-art performance on multiple domains, including faces, 360-degree heads, animals, cartoon stylizations, and clothing edits, while enabling cross-generator and class-agnostic edits. This work provides a practical, flexible solution for high-quality, 3D-consistent reference-based editing with broad potential in creative and production pipelines.
Abstract
Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. We demonstrate how our approach excels across diverse domains, including human faces, 360-degree heads, animal faces, partially stylized edits like cartoon faces, full-body clothing edits, and edits on class-agnostic samples. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.
