Table of Contents
Fetching ...

3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing

Haoran Li, Long Ma, Haolin Shi, Yanbin Hao, Yong Liao, Lechao Cheng, Pengyuan Zhou

TL;DR

3D-GOI tackles the problem of editing complex multi-object scenes by enabling multifaceted affine edits across multiple objects through a three-stage inversion pipeline built on GIRAFFE. It segments scenes into objects and background, employs eight dedicated encoders for coarse code prediction, and refines all codes with a novel round-robin optimization strategy, enabling precise reconstruction of $5n+3$ codes. Empirical results on synthetic datasets (e.g., G-CompCars, Clevr) show superior reconstruction and editing capabilities for single- and multi-object scenes, with novel-view synthesis for faces and robust performance against segmentation errors. The work demonstrates substantial potential for flexible 3D editing in VR/AR and the Metaverse, while noting limitations related to distribution gaps between synthetic generators and real-world data and outlining avenues for future improvements.

Abstract

The current GAN inversion methods typically can only edit the appearance and shape of a single object and background while overlooking spatial information. In this work, we propose a 3D editing framework, 3D-GOI, to enable multifaceted editing of affine information (scale, translation, and rotation) on multiple objects. 3D-GOI realizes the complex editing function by inverting the abundance of attribute codes (object shape/appearance/scale/rotation/translation, background shape/appearance, and camera pose) controlled by GIRAFFE, a renowned 3D GAN. Accurately inverting all the codes is challenging, 3D-GOI solves this challenge following three main steps. First, we segment the objects and the background in a multi-object image. Second, we use a custom Neural Inversion Encoder to obtain coarse codes of each object. Finally, we use a round-robin optimization algorithm to get precise codes to reconstruct the image. To the best of our knowledge, 3D-GOI is the first framework to enable multifaceted editing on multiple objects. Both qualitative and quantitative experiments demonstrate that 3D-GOI holds immense potential for flexible, multifaceted editing in complex multi-object scenes.Our project and code are released at https://3d-goi.github.io .

3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing

TL;DR

3D-GOI tackles the problem of editing complex multi-object scenes by enabling multifaceted affine edits across multiple objects through a three-stage inversion pipeline built on GIRAFFE. It segments scenes into objects and background, employs eight dedicated encoders for coarse code prediction, and refines all codes with a novel round-robin optimization strategy, enabling precise reconstruction of codes. Empirical results on synthetic datasets (e.g., G-CompCars, Clevr) show superior reconstruction and editing capabilities for single- and multi-object scenes, with novel-view synthesis for faces and robust performance against segmentation errors. The work demonstrates substantial potential for flexible 3D editing in VR/AR and the Metaverse, while noting limitations related to distribution gaps between synthetic generators and real-world data and outlining avenues for future improvements.

Abstract

The current GAN inversion methods typically can only edit the appearance and shape of a single object and background while overlooking spatial information. In this work, we propose a 3D editing framework, 3D-GOI, to enable multifaceted editing of affine information (scale, translation, and rotation) on multiple objects. 3D-GOI realizes the complex editing function by inverting the abundance of attribute codes (object shape/appearance/scale/rotation/translation, background shape/appearance, and camera pose) controlled by GIRAFFE, a renowned 3D GAN. Accurately inverting all the codes is challenging, 3D-GOI solves this challenge following three main steps. First, we segment the objects and the background in a multi-object image. Second, we use a custom Neural Inversion Encoder to obtain coarse codes of each object. Finally, we use a round-robin optimization algorithm to get precise codes to reconstruct the image. To the best of our knowledge, 3D-GOI is the first framework to enable multifaceted editing on multiple objects. Both qualitative and quantitative experiments demonstrate that 3D-GOI holds immense potential for flexible, multifaceted editing in complex multi-object scenes.Our project and code are released at https://3d-goi.github.io .
Paper Structure (30 sections, 17 equations, 18 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 17 equations, 18 figures, 7 tables, 1 algorithm.

Figures (18)

  • Figure 1: The first row shows the editing results of traditional 2D/3D GAN inversion methods on multi-object images. The second row showcases 3D-GOI, which can perform multifaceted editing on complex images with multiple objects. 'bg' stands for background. The red crosses in the upper right figures indicate features that cannot be edited with current 2D/3D GAN inversion methods.
  • Figure 2: Different GANs and GAN Inversion methods utilize codes differently.$\omega$ represents the latent code and $c$ represents the camera pose.
  • Figure 3: The overall framework of 3D-GOI. As shown in the upper half, the encoders are trained on single-object scenes, each time using $L_{enc}$ to predict one $w,w\in W$, while other codes use real values. The lower half depicts the inversion process for the multi-object scene. We first decompose objects and background from the scene, then use the trained encoder to extract coarse codes, and finally use the round-robin optimization algorithm to obtain precise codes. The green blocks indicate required training and the yellow blocks indicate fixed parameters.
  • Figure 4: Scene decomposition. (a) The input image. (b) The feature weight map of car A, where the redder regions indicate a higher opacity and the bluer regions lower opacity. (c) The feature weight map of car B. (d) The feature weight map of the background. By integrating these maps, it becomes apparent that the region corresponding to car A predominantly consists of the feature representation of cars A and B. The background's visible area solely contains the background's feature representation.
  • Figure 5: Neural Inversion Encoder. (a) The Neural Rendering Block in GIRAFFE niemeyer2021giraffe, an upsampling process to generate image $\hat{I}$. (b) The Neural Inversion Encoder opposes (a), which is a downsampling process. $I$ is the input image, $H,W$ are image height and width. $I_v$ is the heatmap of the image, $H_v,W_v$ and $M_f$ are the dimensions of $I_v$, $w$ is the code to be predicted, and $w_f$ is the dimension of $w$. Up/Down means upsampling/downsampling.
  • ...and 13 more figures