Table of Contents
Fetching ...

Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation

Xiuding Cai, Yaoyao Zhu, Dong Miao, Linjie Fu, Yu Yao

TL;DR

EnCo redefines content constraints for unpaired image-to-image translation by enforcing latent-space similarity between same-stage encoder and decoder features inside the generator, using a projection head and a predictor with stopping gradients. It performs patch-level, multi-stage constraints and introduces a discriminative attention-guided (DAG) patch sampling strategy, all without external reconstructions or Siamese feature extractors. The full objective combines LS-GAN adversarial loss, a multi-stage content loss with $ ext{λ}_{NCE}=2$ and a strong identity regularizer with $ ext{λ}_{IDT}=10$, achieving state-of-the-art FID across Cityscapes, Cat→Dog, and Horse→Zebra while maintaining training efficiency. This approach offers a compact, efficient alternative to existing content-constraint methods and highlights the benefit of discriminator-informed patch sampling for improved generative quality.

Abstract

In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the \textbf{En}coder and de\textbf{Co}der of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods. Our code is available at https://github.com/XiudingCai/EnCo-pytorch.

Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation

TL;DR

EnCo redefines content constraints for unpaired image-to-image translation by enforcing latent-space similarity between same-stage encoder and decoder features inside the generator, using a projection head and a predictor with stopping gradients. It performs patch-level, multi-stage constraints and introduces a discriminative attention-guided (DAG) patch sampling strategy, all without external reconstructions or Siamese feature extractors. The full objective combines LS-GAN adversarial loss, a multi-stage content loss with and a strong identity regularizer with , achieving state-of-the-art FID across Cityscapes, Cat→Dog, and Horse→Zebra while maintaining training efficiency. This approach offers a compact, efficient alternative to existing content-constraint methods and highlights the benefit of discriminator-informed patch sampling for improved generative quality.

Abstract

In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the \textbf{En}coder and de\textbf{Co}der of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods. Our code is available at https://github.com/XiudingCai/EnCo-pytorch.
Paper Structure (16 sections, 7 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 7 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: A comparison of different content constraints frameworks. (a) Reconstruction-based methods require that $x\leftrightarrow G_2(G_1(x))$, a $\ell_1$ loss or $\ell_2$ loss is always used. Typical methods are CycleGAN CycleGAN2017, UNIT CycleGAN2017, etc. (b) Siamese network-based methods like CUT park2020cut or LSeSim zheng2021spatially complete the content constraint through a defined feature extractor $S$, i.e., $\text{match}(S(G(x), x'))$ or $\text{match}(S(G(x), S(G(x')))$, where $x'$ is the augmented $x$. Note that the augmentation of $x$ is optional (dashed box). (c) EnCo completes the content constraint by agreeing on the representational similarity of features from the encoder and decoder of the generator.
  • Figure 2: (a) The overview of EnCo framework. EnCo constrain the content by agreeing on the representational similarity in the latent space of features from the same stage of the encoder and decoder of the generator. (b) The architecture of the projection. (c) The architecture of the prediction.
  • Figure 3: Results of qualitative comparison. We compare EnCo with existing methods on the Horse$\rightarrow$Zebra, Cat$\rightarrow$Dog, and Cityscapes datasets. EnCo achieves more satisfactory visual results. For example, in the case of Cat$\rightarrow$Dog, EnCo generates a clearer nose for the dog. And in the case of Cityscapes, EnCo successfully generates the traffic cone represented in yellow in the semantic annotation, while the other methods yielded only suboptimal results.
  • Figure 4: Comparison of different patch sampling strategy. For the input image $x$, we superimpose the sampling positions every 10 epochs to obtain the sampling frequency map (b). As can be seen from (c), compared to the random strategy, our proposed DAG patch sampling strategy are more focused on regions that help in domain discrimination, such as ears, eyes, nose, etc. As a result, the model with DAG sampling strategy generates more adorable results than random sampling strategy (see the red box in (d)).
  • Figure 5: Comparison of test FIDs over training time in tasks Cat$\rightarrow$Dog (left) and Cityscapes (right).
  • ...and 4 more figures