ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation
Liang Shi, Yun Fu
TL;DR
ExpertGen addresses fine-grained controllable text-to-face generation without task-specific training by exploiting pre-trained face experts as guidance signals. It integrates off-the-shelf analysts with latent consistency models to obtain reliable intermediate predictions and back-propagate guidance through the diffusion process. The method supports single and multi-expert control over identity, attributes, age, and segmentation maps, aided by text-guided warmup and gradient clipping to stabilize learning-free guidance. Empirical results on SD-v1.5 and SDXL demonstrate robust improvements over text-only and LDM-based guidance, highlighting the practical potential of training-free expert guidance for configurable face synthesis.
Abstract
Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.
