Towards Localized Fine-Grained Control for Facial Expression Generation
Tuomas Varanka, Huai-Qian Khor, Yante Li, Mengting Wei, Hanwei Kung, Nicu Sebe, Guoying Zhao
TL;DR
FineFace introduces Action Units as a localized, continuous conditioning signal for facial expression generation in diffusion models, enabling fine-grained control over muscle movements and complex expressions. The method uses an AU encoder and AU-Adapter, integrated via IP-Adapter with a Stable Diffusion backbone and LoRA fine-tuning to preserve base capabilities. A continuous multi-label AU representation ([0,5]) and distribution smoothing address multi-AU interactions and label noise, while a diverse high-resolution dataset combines AffectNet and DISFA with automatic annotations. Empirical results show improved AU adherence and prompt consistency, and the approach supports integration with image prompts for richer control. This work paves the way for nuanced, authentic facial expressions in generated content, with practical implications for media, avatars, and synthetic data generation.
Abstract
Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate. In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {https://github.com/tvaranka/fineface}.
