Table of Contents
Fetching ...

One-for-All: Towards Universal Domain Translation with a Single StyleGAN

Yong Du, Jiahui Zhan, Xinzhe Li, Junyu Dong, Sheng Chen, Ming-Hsuan Yang, Shengfeng He

TL;DR

This work tackles universal domain translation across visually distinct domains with limited data. It introduces UniTranslator, a hybrid framework that uses CLIP as a domain-neutral bridge and a CLIP2P mapper to align CLIP embeddings with StyleGAN's latent space, enabling high-quality translations between far-apart domains. The key innovations are a decoupling module that extracts domain-agnostic semantics, and a nonlinear CLIP2P mapper that bridges CLIP to StyleGAN’s $P$ space, guided by a suite of losses to preserve cross-domain correspondences and visual fidelity. Extensive experiments show UniTranslator outperforms state-of-the-art learning-based and diffusion-based methods in image quality, domain relevance, and diversity, while remaining robust to degradation and suitable for applications such as style mixing and stylization. This approach offers a practical path toward universal, single-source-to-target-domain translation across diverse visual domains, with public release planned for code and models.

Abstract

In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks. The source code and trained models will be released to the public.

One-for-All: Towards Universal Domain Translation with a Single StyleGAN

TL;DR

This work tackles universal domain translation across visually distinct domains with limited data. It introduces UniTranslator, a hybrid framework that uses CLIP as a domain-neutral bridge and a CLIP2P mapper to align CLIP embeddings with StyleGAN's latent space, enabling high-quality translations between far-apart domains. The key innovations are a decoupling module that extracts domain-agnostic semantics, and a nonlinear CLIP2P mapper that bridges CLIP to StyleGAN’s space, guided by a suite of losses to preserve cross-domain correspondences and visual fidelity. Extensive experiments show UniTranslator outperforms state-of-the-art learning-based and diffusion-based methods in image quality, domain relevance, and diversity, while remaining robust to degradation and suitable for applications such as style mixing and stylization. This approach offers a practical path toward universal, single-source-to-target-domain translation across diverse visual domains, with public release planned for code and models.

Abstract

In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks. The source code and trained models will be released to the public.
Paper Structure (28 sections, 16 equations, 25 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 16 equations, 25 figures, 5 tables, 1 algorithm.

Figures (25)

  • Figure 1: We introduce UniTranslator, an innovative universal framework for translating across diverse visual domains. It can receive input from any real-world source domain and convert it into a specified target domain, all while ensuring high image quality, domain correspondence, and variability.
  • Figure 2: The first row to the last row illustrate visual domain transformations from Metfaces to FFHQ (adjacent domains), AFHQ-dog to FFHQ (far-off domains), and AFHQ-cat to LSUN-church (intensively far-off domains). While GP-UNIT (b) can convert source domain images (a), it suffers from inadequate cross-domain correspondences and compromised image quality. Few-shot (c) or diffusion-based (d) domain adaptation methods display sensitivity to the magnitude of the domain gap. Even in the case of adjacent domains, these methods only result in minor changes to the input image towards the target domain. PULSE (e), lacking decoupling strategies, leaves remnants of source domain patterns when confronted with significant domain gaps. In contrast, UniTranslator (f) consistently achieves high-quality image transformations while upholding domain correspondence despite substantial visual disparities between the domains.
  • Figure 3: Overview of UniTranslator. It leverages the decoupling module to extract domain-agnostic semantics and integrates them with target-specific information, resulting in a refined CLIP embedding with robust cross-domain correlations. This enhanced CLIP embedding will more effectively guide the search for an optimal $z$ code. Moreover, the CLIP2P mapper is engineered to map the CLIP embedding into $P$ space, reducing the likelihood of it falling outside of StyleGAN's latent space. A demo video is included in the supplementary material.
  • Figure 4: Illustration of the proposed decoupling module.
  • Figure 5: Qualitative comparison of our UniTranslator with state-of-the-art methods for translating $\mathcal{X}$ to FFHQ. Note that significant issues encountered by other methods, including severe distortions (VQ-I2I), poor cross-domain correspondences (GP-UNIT and StarGAN2), presence of source domain patterns (VQ-I2I and PULSE), and even the inability to generate target domain patterns (Difa, DiffusionCLIP, and DiffuseIT).
  • ...and 20 more figures