Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

Jingjing Ren; Cheng Xu; Haoyu Chen; Xinran Qin; Lei Zhu

Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

Jingjing Ren, Cheng Xu, Haoyu Chen, Xinran Qin, Lei Zhu

TL;DR

Multi-modal conditioned face synthesis faces challenges of scalability and inflexible control when modalities have differing conditional entropy, denoted by $H$. The authors propose a diffusion-based framework with two key innovations: (i) uni-modal training with modal surrogates that decorate modality-specific conditions and enable inter-modal collaboration within a single diffusion U-Net, and (ii) an entropy-aware modal-adaptive modulation that dynamically adjusts diffusion noise per modality via a weighting module and multiple noise heads, yielding $\epsilon_\theta = \frac{1}{K} \sum_{k=1}^{K} ( w_k(n_k - n_b) + n_b )$. This approach supports flexible multi-modal synthesis using uni-modal annotations and achieves superior fidelity and alignment on Celeb-HQ across diverse condition combinations. The work contributes (1) modal surrogates for condition decoration and inter-modal linking, (2) entropy-aware modulation to adapt denoising to each modality’s information content, and (3) comprehensive ablations and comparisons showing improved performance over state-of-the-art methods. This framework offers scalable, high-quality multi-modal face synthesis with potential for broad applications and lighter training data requirements, along with pathways for integration into editing workflows.

Abstract

Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.

Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

TL;DR

Multi-modal conditioned face synthesis faces challenges of scalability and inflexible control when modalities have differing conditional entropy, denoted by

. The authors propose a diffusion-based framework with two key innovations: (i) uni-modal training with modal surrogates that decorate modality-specific conditions and enable inter-modal collaboration within a single diffusion U-Net, and (ii) an entropy-aware modal-adaptive modulation that dynamically adjusts diffusion noise per modality via a weighting module and multiple noise heads, yielding

. This approach supports flexible multi-modal synthesis using uni-modal annotations and achieves superior fidelity and alignment on Celeb-HQ across diverse condition combinations. The work contributes (1) modal surrogates for condition decoration and inter-modal linking, (2) entropy-aware modulation to adapt denoising to each modality’s information content, and (3) comprehensive ablations and comparisons showing improved performance over state-of-the-art methods. This framework offers scalable, high-quality multi-modal face synthesis with potential for broad applications and lighter training data requirements, along with pathways for integration into editing workflows.

Abstract

Paper Structure (13 sections, 6 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 20 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Latent Diffusion Model
Conditioned Face Synthesis
Method
Uni-modal Training with Modal Surrogates
Entropy-Aware Modal-Adaptive Modulation
Experiments
Experimental Setup
Face Synthesis with Flexible Modal Combinations
Comparison Analysis
Ablation Analysis
Conclusion

Figures (20)

Figure 1: Our method's versatile synthesis capabilities, demonstrating high-fidelity facial image generation from a flexible combination of modalities. Remarkably, these diverse face synthesis tasks are achieved within a single sampling process of a unified diffusion U-Net, demonstrating the method's efficiency and the seamless integration of multi-modal information.
Figure 2: Core idea comparison between existing multi-modal synthesis approaches and our method. (a) Fusing noises from multiple uni-modal diffusion models. (b) Incorporation of additional control mechanisms in basic synthesis models for multi-modal synthesis conditioned synthesis. (c) Our method achieve multi-modal conditioned face synthesis within a single synthesis network, under flexible combination of conditions and dynamically adjust noise of diffusion step.
Figure 3: Uni-modal synthesis results given different control strength $w$. To generate facial images of high fidelity and quality, text and mask require different control strength due to their entropy difference.
Figure 4: Results of multi-modal training. The left are input multi-modal conditions. The synthesis results are presented in the right part. The resulting network can only synthesize pleasing results given all conditions and much of the guidance comes from the low-resolution image. The synthesis network tend to rely on modality of low condition entropy (LR) for synthesis and thus neglect modality with higher condition entropy.
Figure 5: Uni-modal Training with Modal Surrogates
...and 15 more figures

Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

TL;DR

Abstract

Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (20)