UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation

Hengjia Li; Yang Liu; Yuqi Lin; Zhanwei Zhang; Yibo Zhao; weihang Pan; Tu Zheng; Zheng Yang; Yuchun Jiang; Boxi Wu; Deng Cai

UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation

Hengjia Li, Yang Liu, Yuqi Lin, Zhanwei Zhang, Yibo Zhao, weihang Pan, Tu Zheng, Zheng Yang, Yuchun Jiang, Boxi Wu, Deng Cai

TL;DR

UniHDA tackles multifaceted generative domain adaptation by enabling a pre-trained generator to synthesize hybrid domains that combine attributes from multiple text and image references. It maps all references into a unified CLIP embedding space and forms the hybrid-domain direction through linear interpolation of target-domain directions, guided by a multi-modal direction loss. A cross-domain spatial structure loss based on Dino-ViT preserves fine-grained spatial information to maintain consistency with the source domain. The framework is generator-agnostic, validated on 2D and 3D generators as well as diffusion models, and demonstrates strong cross-domain consistency and attribute inheritance across image-image, text-text, and image-text tasks with substantial efficiency gains.

Abstract

Recently, generative domain adaptation has achieved remarkable progress, enabling us to adapt a pre-trained generator to a new target domain. However, existing methods simply adapt the generator to a single target domain and are limited to a single modality, either text-driven or image-driven. Moreover, they cannot maintain well consistency with the source domain, which impedes the inheritance of the diversity. In this paper, we propose UniHDA, a \textbf{unified} and \textbf{versatile} framework for generative hybrid domain adaptation with multi-modal references from multiple domains. We use CLIP encoder to project multi-modal references into a unified embedding space and then linearly interpolate the direction vectors from multiple target domains to achieve hybrid domain adaptation. To ensure \textbf{consistency} with the source domain, we propose a novel cross-domain spatial structure (CSS) loss that maintains detailed spatial structure information between source and target generator. Experiments show that the adapted generator can synthesise realistic images with various attribute compositions. Additionally, our framework is generator-agnostic and versatile to multiple generators, e.g., StyleGAN, EG3D, and Diffusion Models.

UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 11 figures, 6 tables)

This paper contains 18 sections, 6 equations, 11 figures, 6 tables.

Introduction
Related Work
Method
Multi-Modal Hybrid Domain Adaptation
Multi-modal Direction Loss
Linear Composition of Direction Vectors
Cross-domain Spatial Structure Loss
Experiments
Experimental Setting
Image-image Hybrid Domain Adaptation
Text-text Hybrid Domain Adaptation
Image-text Hybrid Domain Adaptation
Comparison with Existing Methods
Generalization on 3D Generator
Generalization on Diffusion Model
...and 3 more sections

Figures (11)

Figure 1: Given a pre-trained source generator and multiple target domains, UniHDA adapts the generator to a hybrid target domain that blends all characteristics at once and maintains robust cross-domain consistency. UniHDA supports both image and text modalities and is versatile to multiple generators.
Figure 2: Existing methods like NADA gal2021stylegan fail to maintain consistency with the source domain for hybrid domain adaptation, resulting in overfitting to the limited references and impeding the inheritance of the diversity in the source domain.
Figure 3: Linear interpolation between multi-modal direction vectors. We represent the domain shift by the direction vector from source embedding to the target (e.g., Crying or Happy). Linear interpolation of them during training will result in a smooth traversal. The coefficients for the right domain are respectively 0, 0.2, 0.4, 0.6, 0.8, and 1, while for the left domain, they are set inversely.
Figure 4: Overview of UniHDA with multi-modal direction loss $\mathcal{L}_{direct}$ and cross-domain spatial structure loss $\mathcal{L}_{\text{CSS}}$. Utilizing CLIP image encoder and text encoder, $\mathcal{L}_{direct}$ encourages $G_{\mathcal{T}}$ to faithfully acquire domain-specific characteristics with multi-modal references. To facilitate diversity inherited from $G_{\mathcal{S}}$, $\mathcal{L}_{\text{CSS}}$ improves cross-domain consistency by maintaining detailed spatial structure information. The red solid line represents positive pairs, while red dashed lines represent negative pairs.
Figure 5: Image-image hybrid domain adaptation. We compare the results of FHDA li2023fhda, NADA gal2021stylegan and UniHDA (Ours) with the same noise. FHDA and NADA generate images with poor cross-domain consistency, leading to a limited diversity. In contrast, UniHDA alleviates overfitting and maintains strong cross-domain consistency.
...and 6 more figures

UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation

TL;DR

Abstract

UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)