Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis
Jingjing Ren, Cheng Xu, Haoyu Chen, Xinran Qin, Lei Zhu
TL;DR
Multi-modal conditioned face synthesis faces challenges of scalability and inflexible control when modalities have differing conditional entropy, denoted by $H$. The authors propose a diffusion-based framework with two key innovations: (i) uni-modal training with modal surrogates that decorate modality-specific conditions and enable inter-modal collaboration within a single diffusion U-Net, and (ii) an entropy-aware modal-adaptive modulation that dynamically adjusts diffusion noise per modality via a weighting module and multiple noise heads, yielding $\epsilon_\theta = \frac{1}{K} \sum_{k=1}^{K} ( w_k(n_k - n_b) + n_b )$. This approach supports flexible multi-modal synthesis using uni-modal annotations and achieves superior fidelity and alignment on Celeb-HQ across diverse condition combinations. The work contributes (1) modal surrogates for condition decoration and inter-modal linking, (2) entropy-aware modulation to adapt denoising to each modality’s information content, and (3) comprehensive ablations and comparisons showing improved performance over state-of-the-art methods. This framework offers scalable, high-quality multi-modal face synthesis with potential for broad applications and lighter training data requirements, along with pathways for integration into editing workflows.
Abstract
Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.
