Table of Contents
Fetching ...

Guided Image Generation with Conditional Invertible Neural Networks

Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, Ullrich Köthe

TL;DR

This work introduces conditional invertible neural networks (cINN) for conditional image generation, combining an invertible coupling core with a conditioning network and training via a maximum-likelihood objective to avoid mode collapse and blur. The method enables diverse, high-quality samples that faithfully respect conditioning, and supports latent-space manipulations for style transfer. Experiments on MNIST generation and diverse colorization (ImageNet and LSUN bedrooms) demonstrate meaningful control and robust performance, with thorough ablations illustrating the impact of architectural and training choices. The approach offers a scalable, likelihood-based alternative to GANs and VAEs for conditional generation, with potential applications across various vision tasks.

Abstract

In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.

Guided Image Generation with Conditional Invertible Neural Networks

TL;DR

This work introduces conditional invertible neural networks (cINN) for conditional image generation, combining an invertible coupling core with a conditioning network and training via a maximum-likelihood objective to avoid mode collapse and blur. The method enables diverse, high-quality samples that faithfully respect conditioning, and supports latent-space manipulations for style transfer. Experiments on MNIST generation and diverse colorization (ImageNet and LSUN bedrooms) demonstrate meaningful control and robust performance, with thorough ablations illustrating the impact of architectural and training choices. The approach offers a scalable, likelihood-based alternative to GANs and VAEs for conditional generation, with potential applications across various vision tasks.

Abstract

In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.

Paper Structure

This paper contains 14 sections, 7 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Diverse colorizations, which our network created for the same grayscale image. One of them shows ground truth colors, but which? Solution at the bottom of the page.
  • Figure 2: One conditional affine coupling block (CC).
  • Figure 3: Haar wavelet downsampling reduces spatial dimensions & separates lower frequencies (a) from high (h,v,d).
  • Figure 4: Axes in our MNIST model's latent space, which linearly encode the style attributes width, thickness and slant.
  • Figure 5: cINN model for conditional MNIST generation.
  • ...and 11 more figures