Table of Contents
Fetching ...

HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Abdul Basit Anees, Ahmet Canberk Baykal, Muhammed Burak Kizil, Duygu Ceylan, Erkut Erdem, Aykut Erdem

TL;DR

This work presents a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks and introduces a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality.

Abstract

Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.

HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

TL;DR

This work presents a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks and introduces a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality.

Abstract

Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.

Paper Structure

This paper contains 21 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of HyperGAN-CLIP. This framework employs hypernetwork modules to adjust StyleGAN generator weights based on images or text prompts. These inputs facilitate domain adaptation, attribute transfer, or image editing. The modulated weights blend with original features to produce images that align with specified domains or tasks like reference-guided synthesis and text-guided manipulation, while maintaining source integrity.
  • Figure 2: Comparison against the state-of-the-art few-shot domain adaptation methods. Our proposed HyperGAN-CLIP model outperforms competing methods in accurately capturing the visual characteristics of the target domains.
  • Figure 3: Capabilities of HyperGAN-CLIP in blending domains and performing semantic edits within adapted domains.
  • Figure 4: Comparison with state-of-the-art reference-guided image synthesis approaches. Our approach effectively transfers the style of the target image to the source image while effectively preserving identity compared to competing methods.
  • Figure 5: Reference-guided image synthesis with mixed embeddings. Each row shows the input image, the initial result with the CLIP image embedding, the refined result with a mixed embedding that incorporates the target attribute with $\alpha=0.5$, and the reference image, respectively. Target text attributes are "beard" (top row), "black hair" (middle row), and "smiling" (bottom row). Incorporating mixed modality embeddings results in more accurate and detailed image modifications.
  • ...and 1 more figures