GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D Generators
Hengjia Li, Yang Liu, Yibo Zhao, Haoran Cheng, Yang Yang, Linxuan Xia, Zekai Luo, Qibo Qiu, Boxi Wu, Tu Zheng, Zheng Yang, Deng Cai
TL;DR
GCA-3D tackles the core challenges of 3D generative domain adaptation by removing the need for data-generation pipelines and enabling both text-guided and one-shot image-guided adaptation. It introduces a depth-aware SDS loss (DSDS) that conditions on depth maps from the source generator and foreground masks to reduce overfitting while using multi-modal prompts via CLIP and IP-Adapter, and a hierarchical spatial consistency (HSC) loss that aligns cross-domain spatial structure with a coarse-to-fine, patch-based contrastive objective. The combination, $\mathcal{L}_{\text{overall}} = \mathcal{L}_{\text{DSDS}} + \lambda \mathcal{L}_{\text{HSC}}$, yields improved pose accuracy, identity consistency, and diversity across diverse text- and image-guided targets, with demonstrated efficiency gains over prior pipeline methods. This approach has significant practical impact for flexible, high-quality 3D domain adaptation in graphics and vision tasks, while recognizing diffusion-model limitations as a constraint to future improvements.
Abstract
Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.
