Table of Contents
Fetching ...

ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation

Liang Shi, Yun Fu

TL;DR

ExpertGen addresses fine-grained controllable text-to-face generation without task-specific training by exploiting pre-trained face experts as guidance signals. It integrates off-the-shelf analysts with latent consistency models to obtain reliable intermediate predictions and back-propagate guidance through the diffusion process. The method supports single and multi-expert control over identity, attributes, age, and segmentation maps, aided by text-guided warmup and gradient clipping to stabilize learning-free guidance. Empirical results on SD-v1.5 and SDXL demonstrate robust improvements over text-only and LDM-based guidance, highlighting the practical potential of training-free expert guidance for configurable face synthesis.

Abstract

Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.

ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation

TL;DR

ExpertGen addresses fine-grained controllable text-to-face generation without task-specific training by exploiting pre-trained face experts as guidance signals. It integrates off-the-shelf analysts with latent consistency models to obtain reliable intermediate predictions and back-propagate guidance through the diffusion process. The method supports single and multi-expert control over identity, attributes, age, and segmentation maps, aided by text-guided warmup and gradient clipping to stabilize learning-free guidance. Empirical results on SD-v1.5 and SDXL demonstrate robust improvements over text-only and LDM-based guidance, highlighting the practical potential of training-free expert guidance for configurable face synthesis.

Abstract

Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.

Paper Structure

This paper contains 23 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Evaluating the quality of intermediate predictions. (a) FID of face images at different DDIM steps (earliest steps omitted due to large values). While both models improve, LCM achieves higher image quality much faster. (b) Face recognition features of LCM’s intermediate predictions converge within early steps. (c) Low-quality early-step predictions lack distinctive facial features, and therefore form a multi-color cluster in feature space, resulting in ambiguous gradient guidance.
  • Figure 2: Qualitative and quantitative results of facial attribute guidance. (a) We select five attributes that are challenging to generate using text conditions alone, and demonstrate how ExpertGen effectively morphs the image across eight DDIM time steps to generate correct attributes (zoom in for details). (b) Average probability of successful generation before and after ExpertGen across all 40 Celeb-A celeba facial attributes. Most attributes gain substantial improvements with ExpertGen.
  • Figure 3: Visualizations of ID guidance, segmentation map guidance, and age guidance. Conditions of each task is provided next to the generations: a reference image for ID guidance, a target segmentation map for segmentation guidance, and an age label for age guidance.
  • Figure 4: Visualizations of multi-expert guidance. Given an ID target on the left, we simultaneously apply ID and attribute or age guidance to generate images of the same person under new conditions.
  • Figure 5: Average probability of successful generation on SDXL before and after ExpertGen across all 40 Celeb-A celeba facial attributes.