Table of Contents
Fetching ...

Condition Matters in Full-head 3D GANs

Heyuan Li, Huimin Zhang, Yuda Qiu, Zhengwentai Sun, Keru Zheng, Lingteng Qiu, Peihao Li, Qi Zuo, Ce Chen, Yujian Zheng, Yuming Gu, Zilong Dong, Xiaoguang Han

TL;DR

This work tackles the directional bias and training instability that arise when full-head 3D GANs are conditioned on view angles. It introduces BalanceHead, a semantic-conditional 3D-aware GAN trained on BalanceHead360, which uses a view-invariant front-view CLIP feature as a shared conditioning signal and a ViCiCo loss to enforce content–condition consistency. The approach yields high fidelity, diverse, and globally coherent 360° full-head generations and robust single-view inversions, outperforming view-conditioned baselines across qualitative and quantitative metrics. By leveraging large-scale 2D priors and a balance across views, the method demonstrates that imperfect multi-view data can supervise 3D consistency, with broad implications for 3D head synthesis and downstream tasks such as 3D hair modeling and avatar creation.

Abstract

Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.

Condition Matters in Full-head 3D GANs

TL;DR

This work tackles the directional bias and training instability that arise when full-head 3D GANs are conditioned on view angles. It introduces BalanceHead, a semantic-conditional 3D-aware GAN trained on BalanceHead360, which uses a view-invariant front-view CLIP feature as a shared conditioning signal and a ViCiCo loss to enforce content–condition consistency. The approach yields high fidelity, diverse, and globally coherent 360° full-head generations and robust single-view inversions, outperforming view-conditioned baselines across qualitative and quantitative metrics. By leveraging large-scale 2D priors and a balance across views, the method demonstrates that imperfect multi-view data can supervise 3D consistency, with broad implications for 3D head synthesis and downstream tasks such as 3D hair modeling and avatar creation.

Abstract

Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.
Paper Structure (44 sections, 2 equations, 16 figures, 3 tables)

This paper contains 44 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: (a) No conditioning leads to early mode collapse and unstable training. (b) Disabling view conditioning mid-training causes rapid collapse within 1000 kimg. (c) Semantic conditioning enables faster and more effective training. (d–i) View-conditioned models show strong directional bias and global incoherence; while conditional views are realistic, non-conditional views are distorted and inconsistent. (d,e), (f,g), and (h,i) are results conditionally generated by random conditional views from PanoHead an2023panohead, SphereHead li2024spherehead, and HyPlaneHead li2025hyplanehead.
  • Figure 2: Overview of our data generation pipeline. The first stage selects a large number of near-front-view head and facial images, generates their corresponding front-view images using Flux.1 Kontext, and performs data preprocessing. In the second stage, we similarly leverage Flux.1 Kontext with different view-angle prompts to extend the front-view images into multi-view collections. Finally, an image filtering agent based on Qwen2.5-VL is employed to remove images with artifacts or global incoherence.
  • Figure 3: The Overview of our BalanceHead pipeline.
  • Figure 4: Qualitative comparison with state-of-the art methods. Conditioned on front-view: (a) EG3D (b) GGHead (c) PanoHead (d) SphereHead. (e) HyPlaneHead conditioned on back-view. (f) Our view-conditional baseline conditioned on back-view. (g) Our view-semantic-conditional baseline conditioned on side-view. (h-n) Our BalanceHead conditioned on view-invariant semantic condition.
  • Figure 5: Single-view 3D-aware GAN Inversion.
  • ...and 11 more figures