Table of Contents
Fetching ...

GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D Generators

Hengjia Li, Yang Liu, Yibo Zhao, Haoran Cheng, Yang Yang, Linxuan Xia, Zekai Luo, Qibo Qiu, Boxi Wu, Tu Zheng, Zheng Yang, Deng Cai

TL;DR

GCA-3D tackles the core challenges of 3D generative domain adaptation by removing the need for data-generation pipelines and enabling both text-guided and one-shot image-guided adaptation. It introduces a depth-aware SDS loss (DSDS) that conditions on depth maps from the source generator and foreground masks to reduce overfitting while using multi-modal prompts via CLIP and IP-Adapter, and a hierarchical spatial consistency (HSC) loss that aligns cross-domain spatial structure with a coarse-to-fine, patch-based contrastive objective. The combination, $\mathcal{L}_{\text{overall}} = \mathcal{L}_{\text{DSDS}} + \lambda \mathcal{L}_{\text{HSC}}$, yields improved pose accuracy, identity consistency, and diversity across diverse text- and image-guided targets, with demonstrated efficiency gains over prior pipeline methods. This approach has significant practical impact for flexible, high-quality 3D domain adaptation in graphics and vision tasks, while recognizing diffusion-model limitations as a constraint to future improvements.

Abstract

Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.

GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D Generators

TL;DR

GCA-3D tackles the core challenges of 3D generative domain adaptation by removing the need for data-generation pipelines and enabling both text-guided and one-shot image-guided adaptation. It introduces a depth-aware SDS loss (DSDS) that conditions on depth maps from the source generator and foreground masks to reduce overfitting while using multi-modal prompts via CLIP and IP-Adapter, and a hierarchical spatial consistency (HSC) loss that aligns cross-domain spatial structure with a coarse-to-fine, patch-based contrastive objective. The combination, , yields improved pose accuracy, identity consistency, and diversity across diverse text- and image-guided targets, with demonstrated efficiency gains over prior pipeline methods. This approach has significant practical impact for flexible, high-quality 3D domain adaptation in graphics and vision tasks, while recognizing diffusion-model limitations as a constraint to future improvements.

Abstract

Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.

Paper Structure

This paper contains 21 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: GCA-3D is generalized to both text-guided and image-guided 3D generative domain adaptation. It successfully generate diverse results in the target domain, consistently maintaining accurate pose and diverse identity in the source domain, while the baseline methods fail. * means the 3D extension version of the 2D adaptation methods.
  • Figure 2: Bias issues in the synthetic dataset of pipeline methods like DATID-3D kim2022datid conditioned by single text prompt or one-shot image reference using the translation pipeline of Stable Diffusion rombach2022high with IP-Adapter ye2023ip. (1) In the data generation, pose bias introduced by the condition can affect pose consistency, particularly when the condition is a single image, leading to the generation of mostly forward-facing images. (2) Attributes within the image reference, such as closed eyes in joker, can also introduce identity bias, thereby impacting the diversity of the results.
  • Figure 3: Overview of GCA-3D. Given the source generator 3D generator $G_\mathcal{S}$ and our target generator $G_\mathcal{T}$ (initialized from $G_\mathcal{S}$), we propose depth-aware SDS (DSDS) loss to enable multi-modal 3D domain adaptation via pre-trained CLIP encoder and IP-Adapter. To eliminate the pose and identity biases of the target domain, we propose Hierarchical Spatial Consistency (HSC) loss to coarse-to-fine align the synthesized images of $G_\mathcal{S}$ and $G_\mathcal{T}$ given the same noise $z$.
  • Figure 4: Qualitative comparison with existing image-guided domain adaptation methods. Our GCA-3D significantly surpasses baseline methods in diversity, pose accuracy and identity consistency.
  • Figure 5: Qualitative comparison with image-guided adversarial adaptation method. We extend DATID-3D with IP-Adapter to enable image-guided adaptation, which suffers from poor pose accuracy and diversity. In contrast, our method generate diverse samples with excellent pose accuracy and identity consistency. Here we use the same image reference in \ref{['fig:img']}.
  • ...and 6 more figures