Table of Contents
Fetching ...

3D-aware Image Generation and Editing with Multi-modal Conditions

Bo Li, Yi-ke Li, Zhi-fen He, Bin Liu, Yun-Kun Lai

TL;DR

This work dives into the latent space of 3D Generative Adversarial Networks (GANs) and proposes a novel disentanglement strategy to separate appearance features from shape features during the generation process, which outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.

Abstract

3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.

3D-aware Image Generation and Editing with Multi-modal Conditions

TL;DR

This work dives into the latent space of 3D Generative Adversarial Networks (GANs) and proposes a novel disentanglement strategy to separate appearance features from shape features during the generation process, which outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.

Abstract

3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.
Paper Structure (16 sections, 11 equations, 13 figures, 2 tables)

This paper contains 16 sections, 11 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Given a 2D semantic map, the proposed method can accomplish 3D-aware multi-view consistency generation with multi-modal conditions, such as generating diverse images with distinct noises, editing the attribute through a text description and conducting style transfer by giving a reference RGB image. It is noticed that the proposed method ensures the generation of appearance consistency under a distinctive condition for various semantic maps as shown in each column.
  • Figure 2: Each row showcases the conditional image generation outcomes utilizing identical style latent variable for distinct semantic shapes.
  • Figure 3: Model overview. (a) The framework of the proposed 3D-aware image generation model with multi-modal conditions. Given a semantic map, we can utilize a random noise, a text prompt, or a reference image as conditional inputs to accomplish 3D-aware multi-view consistency image generation. (b) The detail of the proposed interactive module.
  • Figure 4: Style mixing results in the latent space of EG3Dchan2021efficient. Given a source image A (on the first column) and a target image B (on the top row), the mixed images are synthesized by blending the style vectors of A and B at a selected intersection point N.
  • Figure 5: Quantitative evaluations of semantic structure similarity of style mixing in the latent space of EG3Dchan2021efficient.
  • ...and 8 more figures