Table of Contents
Fetching ...

Attribute2Image: Conditional Image Generation from Visual Attributes

Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee

TL;DR

The paper tackles generating images from visual attributes by introducing a layered, disentangled generative framework that separates foreground and background factors. It presents disCVAE, a two-stream extension of CVAE with a gating mechanism, trained end-to-end to produce attribute-conditioned samples and support reconstruction and completion via optimization-based posterior inference. Experiments on LFW and CUB demonstrate realistic, diverse samples and superior handling of complex textures and shapes, with particular gains for birds when latent-space disentangling is applied. The work offers a principled approach for controllable, interpretable image synthesis and practical post-hoc inference for novel inputs.

Abstract

This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.

Attribute2Image: Conditional Image Generation from Visual Attributes

TL;DR

The paper tackles generating images from visual attributes by introducing a layered, disentangled generative framework that separates foreground and background factors. It presents disCVAE, a two-stream extension of CVAE with a gating mechanism, trained end-to-end to produce attribute-conditioned samples and support reconstruction and completion via optimization-based posterior inference. Experiments on LFW and CUB demonstrate realistic, diverse samples and superior handling of complex textures and shapes, with particular gains for birds when latent-space disentangling is applied. The work offers a principled approach for controllable, interpretable image synthesis and practical post-hoc inference for novel inputs.

Abstract

This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.

Paper Structure

This paper contains 28 sections, 15 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An example that demonstrates the problem of conditioned image generation from visual attributes. We assume a vector of visual attributes is extracted from a natural language description, and then this attribute vector is combined with learned latent factors to generate diverse image samples.
  • Figure 2: Graphical model representations of attribute-conditioned image generation models (a) without (CVAE) and (b) with (disCVAE) disentangled latent space.
  • Figure 3: Attribute-conditioned image generation.
  • Figure 4: Attribute-conditioned image progression. The visualization is organized into six attribute groups (e.g., "gender", "age", "facial expression", "eyewear", "hair color" and "primary color (blue vs. yellow)"). Within each group, the images are generated from $p_\theta(x|y,z)$ with $z \sim \mathcal{N}(0,I)$ and $y = [y_\alpha,y_{rest}]$, where $y_\alpha = (1-\alpha) \cdot y_{min} + \alpha \cdot y_{max}$. Here, $y_{min}$ and $y_{max}$ stands for the minimum and maximum attribute value respectively in the dataset along the corresponding dimension.
  • Figure 5: Analysis: Latent Space Disentangling.
  • ...and 2 more figures