Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation
Xiuding Cai, Yaoyao Zhu, Dong Miao, Linjie Fu, Yu Yao
TL;DR
EnCo redefines content constraints for unpaired image-to-image translation by enforcing latent-space similarity between same-stage encoder and decoder features inside the generator, using a projection head and a predictor with stopping gradients. It performs patch-level, multi-stage constraints and introduces a discriminative attention-guided (DAG) patch sampling strategy, all without external reconstructions or Siamese feature extractors. The full objective combines LS-GAN adversarial loss, a multi-stage content loss with $ ext{λ}_{NCE}=2$ and a strong identity regularizer with $ ext{λ}_{IDT}=10$, achieving state-of-the-art FID across Cityscapes, Cat→Dog, and Horse→Zebra while maintaining training efficiency. This approach offers a compact, efficient alternative to existing content-constraint methods and highlights the benefit of discriminator-informed patch sampling for improved generative quality.
Abstract
In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the \textbf{En}coder and de\textbf{Co}der of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods. Our code is available at https://github.com/XiudingCai/EnCo-pytorch.
