Table of Contents
Fetching ...

Reference-Based 3D-Aware Image Editing with Triplanes

Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, Aysegul Dundar

TL;DR

The paper introduces a reference-based, 3D-aware image editing framework built on triplane representations (EG3D) to enable faithful, view-consistent edits from a single image. It localizes editing regions in triplane space via gradient-propagated segmentation masks, then achieves seamless fusion through an implicit fusion step powered by a fine-tuned encoder and a canonical-pose rendering, blending with $T_{ ext{f}} = ext{E}(M_{ ext{ref}}) \cdot T_{ ext{ref}} + \text{E}(M_{ ext{src}}) \cdot T_{ ext{src}} + (\text{E}(M_{ ext{src}}) - \text{E}(M_{ ext{ref}})) \cdot T_{ ext{imp}}$. A targeted encoder fine-tuning stage uses masked, multi-view losses (LPIPS and ArcFace identity) to improve boundary realism and reduce background leakage. The approach demonstrates state-of-the-art performance on multiple domains, including faces, 360-degree heads, animals, cartoon stylizations, and clothing edits, while enabling cross-generator and class-agnostic edits. This work provides a practical, flexible solution for high-quality, 3D-consistent reference-based editing with broad potential in creative and production pipelines.

Abstract

Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. We demonstrate how our approach excels across diverse domains, including human faces, 360-degree heads, animal faces, partially stylized edits like cartoon faces, full-body clothing edits, and edits on class-agnostic samples. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.

Reference-Based 3D-Aware Image Editing with Triplanes

TL;DR

The paper introduces a reference-based, 3D-aware image editing framework built on triplane representations (EG3D) to enable faithful, view-consistent edits from a single image. It localizes editing regions in triplane space via gradient-propagated segmentation masks, then achieves seamless fusion through an implicit fusion step powered by a fine-tuned encoder and a canonical-pose rendering, blending with . A targeted encoder fine-tuning stage uses masked, multi-view losses (LPIPS and ArcFace identity) to improve boundary realism and reduce background leakage. The approach demonstrates state-of-the-art performance on multiple domains, including faces, 360-degree heads, animals, cartoon stylizations, and clothing edits, while enabling cross-generator and class-agnostic edits. This work provides a practical, flexible solution for high-quality, 3D-consistent reference-based editing with broad potential in creative and production pipelines.

Abstract

Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. We demonstrate how our approach excels across diverse domains, including human faces, 360-degree heads, animal faces, partially stylized edits like cartoon faces, full-body clothing edits, and edits on class-agnostic samples. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.
Paper Structure (11 sections, 6 equations, 23 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 6 equations, 23 figures, 3 tables, 1 algorithm.

Figures (23)

  • Figure 1: Our approach excels in refer ence-based edits, faithfully reproducing the copied reference parts with a single source and reference image. Leveraging 3D-aware triplanes, our edits are versatile and 3D consistent, allowing for rendering from various viewpoints. We show results on human faces, heads, bodies, and extending beyond to animal faces and class-agnostic samples.
  • Figure 2: Current methods struggle with 3D consistency barbershophairclipv2paintbyexample, faithfulness to the reference kafri2021stylefusioninfeditledits++, and visual artifacts infeditpaintbyexample. Our method provides 3D-consistent, reference-based edits from single images, independent of camera poses. N/A indicates the model is incapable of such edits.
  • Figure 3: Triplane part localization stage, where $\mathbf{E}$, $\mathbf{G}$, and $\mathcal{R}$ are encoder yuan2023makebhattarai2024triplanenet, generator chan2022efficient, and neural volumetric renderer, respectively. For the 2D segmentation model $\mathcal{S}_\text{2D}$, we use state-of-the-art off-the-shelf segmentation models faceparsingpytorch. Images other than the input image are zoomed in for visualization purposes.
  • Figure 4: Triplane localization and implicit fusion stages, where $\mathbf{E}^*$ denotes the fine-tuned image encoder that is described in \ref{['sec:fine-tune']}. Straightforward stitching in the triplane results in color inconsistency across the boundaries, as shown in $I_{tmp}$ (zoom in for details). Leveraging $\mathbf{E}^*$, we aim to attain seamless boundaries and produce outputs with a natural appearance.
  • Figure 5: Pipeline for the implicit fusion encoder fine-tuning. We generate masked ground truths for our task by utilizing 2D segmentation networks and via renderings from multiple views. We aim to carry the reference parts in great detail to our source image while preserving the source's identity. Triplane fusion corresponds to \ref{['eqn:final_fusion']}.
  • ...and 18 more figures