Table of Contents
Fetching ...

Swapping Autoencoder for Deep Image Manipulation

Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang

TL;DR

The Swapping Autoencoder addresses controllable image editing by learning a two-code latent space that separately encodes texture and structure. A texture-focused patch co-occurrence discriminator and a swap-based training objective encourage independent, transferable factors, enabling texture swapping, region edits, and latent space vector arithmetic while maintaining realism via a StyleGAN2-based generator. The model embeds real images in real-time and produces realistic image hybrids and translations, achieving favorable perceptual and reconstruction metrics compared with baselines across several datasets. An interactive UI demonstrates practical editing workflows, highlighting the method's potential for democratized, controllable image manipulation. The work also discusses limitations and broader implications for image provenance and future evaluation of disentanglement quality.

Abstract

Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging. We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation, rather than random sampling. The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image. In particular, we encourage the components to represent structure and texture, by enforcing one component to encode co-occurrent patch statistics across different parts of an image. As our method is trained with an encoder, finding the latent codes for a new input image becomes trivial, rather than cumbersome. As a result, it can be used to manipulate real input images in various ways, including texture swapping, local and global editing, and latent code vector arithmetic. Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.

Swapping Autoencoder for Deep Image Manipulation

TL;DR

The Swapping Autoencoder addresses controllable image editing by learning a two-code latent space that separately encodes texture and structure. A texture-focused patch co-occurrence discriminator and a swap-based training objective encourage independent, transferable factors, enabling texture swapping, region edits, and latent space vector arithmetic while maintaining realism via a StyleGAN2-based generator. The model embeds real images in real-time and produces realistic image hybrids and translations, achieving favorable perceptual and reconstruction metrics compared with baselines across several datasets. An interactive UI demonstrates practical editing workflows, highlighting the method's potential for democratized, controllable image manipulation. The work also discusses limitations and broader implications for image provenance and future evaluation of disentanglement quality.

Abstract

Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging. We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation, rather than random sampling. The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image. In particular, we encourage the components to represent structure and texture, by enforcing one component to encode co-occurrent patch statistics across different parts of an image. As our method is trained with an encoder, finding the latent codes for a new input image becomes trivial, rather than cumbersome. As a result, it can be used to manipulate real input images in various ways, including texture swapping, local and global editing, and latent code vector arithmetic. Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.

Paper Structure

This paper contains 25 sections, 4 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Our Swapping Autoencoder learns to disentangle texture from structure for image editing tasks. One such task is texture swapping, shown here. Please see our project https://taesungp.github.io/SwappingAutoencoder for a demo video of our editing method.
  • Figure 2: Swapping Autoencoder consists of autoencoding (top) and swapping (bottom) operation. (Top) An encoder $E$ embeds an input (Notre-Dame) into two codes. The structure code ( ) is a tensor with spatial dimensions; the texture code ( ) is a 2048-dimensional vector. Decoding with generator $G$ should produce a realistic image (enforced by discriminator $D$) matching the input (reconstruction loss). (Bottom) Decoding with the texture code from a second image (Saint Basil's Cathedral) should look realistic (via $D$) and match the texture of the image, by training with a patch co-occurrence discriminator $D_{\text{patch}}$ that enforces the output and reference patches look indistinguishable.
  • Figure 3: Embedding examples and reconstruction quality. We project images into embedding spaces for our method and baseline GAN models, Im2StyleGAN abdal2019image2stylegankarras2019style and StyleGAN2 karras2020analyzing. Our reconstructions better preserve the detailed outline (e.g., doorway, eye gaze) than StyleGAN2, and appear crisper than Im2StyleGAN. This is verified on average with the LPIPS metric zhang2018unreasonable. Our method also reconstructs images much faster than recent generative models that use iterative optimization. See Appendix \ref{['sec:app:results']} for more visual examples.
  • Figure 4: Image swapping. Each row shows the result of combining the structure code of the leftmost image with the texture code of the top image (trained on LSUN Church and Bedroom). Our model generates realistic images that preserve texture (e.g., material of the building, or the bedsheet pattern) and structure (outline of objects).
  • Figure 5: Comparison of image hybrids. Our approach generates realistic results that combine scene structure with elements of global texture, such as the shape of the towers (church), the hair color (portrait), and the long exposure (waterfall). Please see Appendix \ref{['sec:app:results']} for more comparisons.
  • ...and 13 more figures