Table of Contents
Fetching ...

Towards Localized Fine-Grained Control for Facial Expression Generation

Tuomas Varanka, Huai-Qian Khor, Yante Li, Mengting Wei, Hanwei Kung, Nicu Sebe, Guoying Zhao

TL;DR

FineFace introduces Action Units as a localized, continuous conditioning signal for facial expression generation in diffusion models, enabling fine-grained control over muscle movements and complex expressions. The method uses an AU encoder and AU-Adapter, integrated via IP-Adapter with a Stable Diffusion backbone and LoRA fine-tuning to preserve base capabilities. A continuous multi-label AU representation ([0,5]) and distribution smoothing address multi-AU interactions and label noise, while a diverse high-resolution dataset combines AffectNet and DISFA with automatic annotations. Empirical results show improved AU adherence and prompt consistency, and the approach supports integration with image prompts for richer control. This work paves the way for nuanced, authentic facial expressions in generated content, with practical implications for media, avatars, and synthetic data generation.

Abstract

Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate. In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {https://github.com/tvaranka/fineface}.

Towards Localized Fine-Grained Control for Facial Expression Generation

TL;DR

FineFace introduces Action Units as a localized, continuous conditioning signal for facial expression generation in diffusion models, enabling fine-grained control over muscle movements and complex expressions. The method uses an AU encoder and AU-Adapter, integrated via IP-Adapter with a Stable Diffusion backbone and LoRA fine-tuning to preserve base capabilities. A continuous multi-label AU representation ([0,5]) and distribution smoothing address multi-AU interactions and label noise, while a diverse high-resolution dataset combines AffectNet and DISFA with automatic annotations. Empirical results show improved AU adherence and prompt consistency, and the approach supports integration with image prompts for richer control. This work paves the way for nuanced, authentic facial expressions in generated content, with practical implications for media, avatars, and synthetic data generation.

Abstract

Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate. In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {https://github.com/tvaranka/fineface}.
Paper Structure (35 sections, 5 equations, 19 figures, 3 tables)

This paper contains 35 sections, 5 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: The proposed method, FineFace, enables precise control over individual muscle movements of the face. By combining several Action Units (AUs), FineFace can generate complex and nuanced facial expressions. Our adapter architecture-based approach enables integration with image prompts using IP-Adapter ipadapter.
  • Figure 2: Display of a selection of different action units and the intensity scale. Figure repurposed from meb. For a complete collection of AUs with videos see facs_cheat_sheet.
  • Figure 3: FineFace generates an image based on a text prompt and an AU condition. The AU condition vector is first passed to an AU encoder and subsequently to the AU-Adapter. The output of the AU attention is then added with the existing text attention. In this setup, only the AU encoder and the K and V projection matrices are trainable, while the other layers remain frozen.
  • Figure 4: Comparison of different methods on 12 individual AUs with the prompt A close-up of Barack Obama. See \ref{['fig:aus']} for the textual descriptions of AUs.
  • Figure 5: Comparison of methods on combination AUs with the prompt An Asian woman in the park.
  • ...and 14 more figures