Table of Contents
Fetching ...

A Unified and Interpretable Emotion Representation and Expression Generation

Reni Paskaleva, Mykyta Holubakha, Andela Ilic, Saman Motamed, Luc Van Gool, Danda Paudel

TL;DR

The paper introduces C2A2, a unified and interpretable emotion representation that merges Canonical, Compound, AUs, and AV into a 3D space with a learned third axis $Z$. It presents implicit supervision for $Z$ via GANmut and maps images back to the space using $\hat{Z}$, while enabling continuous, fine-grained control through a number encoder integrated with text-to-image diffusion (DreamBooth/Stable Diffusion) for text+number conditioning. Empirical results on AffectNet show that the 3D C2A2 representation covers more compound emotions (15/17) than the 2D AV space and yields superior quantitative (FED, ERE, SS) and qualitative outcomes, corroborated by expert human evaluations. The approach offers a practical pathway to richer, controllable emotion generation in visual media and lays groundwork for extensions to temporal dynamics and ethical deployment.

Abstract

Canonical emotions, such as happy, sad, and fearful, are easy to understand and annotate. However, emotions are often compound, e.g. happily surprised, and can be mapped to the action units (AUs) used for expressing emotions, and trivially to the canonical ones. Intuitively, emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalities - namely, Canonical, Compound, AUs, and AV - is highly desirable, for a better representation and understanding of emotions. However, such unification remains to be unknown in the current literature. In this work, we propose an interpretable and unified emotion model, referred as C2A2. We also develop a method that leverages labels of the non-unified models to annotate the novel unified one. Finally, we modify the text-conditional diffusion models to understand continuous numbers, which are then used to generate continuous expressions using our unified emotion model. Through quantitative and qualitative experiments, we show that our generated images are rich and capture subtle expressions. Our work allows a fine-grained generation of expressions in conjunction with other textual inputs and offers a new label space for emotions at the same time.

A Unified and Interpretable Emotion Representation and Expression Generation

TL;DR

The paper introduces C2A2, a unified and interpretable emotion representation that merges Canonical, Compound, AUs, and AV into a 3D space with a learned third axis . It presents implicit supervision for via GANmut and maps images back to the space using , while enabling continuous, fine-grained control through a number encoder integrated with text-to-image diffusion (DreamBooth/Stable Diffusion) for text+number conditioning. Empirical results on AffectNet show that the 3D C2A2 representation covers more compound emotions (15/17) than the 2D AV space and yields superior quantitative (FED, ERE, SS) and qualitative outcomes, corroborated by expert human evaluations. The approach offers a practical pathway to richer, controllable emotion generation in visual media and lays groundwork for extensions to temporal dynamics and ethical deployment.

Abstract

Canonical emotions, such as happy, sad, and fearful, are easy to understand and annotate. However, emotions are often compound, e.g. happily surprised, and can be mapped to the action units (AUs) used for expressing emotions, and trivially to the canonical ones. Intuitively, emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalities - namely, Canonical, Compound, AUs, and AV - is highly desirable, for a better representation and understanding of emotions. However, such unification remains to be unknown in the current literature. In this work, we propose an interpretable and unified emotion model, referred as C2A2. We also develop a method that leverages labels of the non-unified models to annotate the novel unified one. Finally, we modify the text-conditional diffusion models to understand continuous numbers, which are then used to generate continuous expressions using our unified emotion model. Through quantitative and qualitative experiments, we show that our generated images are rich and capture subtle expressions. Our work allows a fine-grained generation of expressions in conjunction with other textual inputs and offers a new label space for emotions at the same time.
Paper Structure (10 sections, 2 equations, 9 figures, 3 tables)

This paper contains 10 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We show the capability of our continuous 3D-representation-based expression generation method in generating rich and compound expressions. An extra arbitrarily chosen expression component (+X) is added to the targeted compound on the left. The proposed 3D model performs the best compared to the 2D model and other competing methods. Our model shares the same settings with DreamBooth.
  • Figure 2: The compound emotion model on the left unifies the categorical emotions and the AUs based expressions du2014compound. The continuous emotion model of arousal-valence (middle) allows the mapping of some of the categorical emotion on the continuous space PADrusselmehrabis. The proposed 3D-based emotion modelling largely unifies the both thereby allowing more combination of the compound emotions (right).
  • Figure 3: We use a number encoder that embeds the continuous 3D representations of emotions. The embedded numbers are fused with the text embedding before decoding into number+text-to-image generation. The learning is done using the frozen text-encoder and shared image decoder. During learning, our method uses prior preservation and emotion reconstruction loss, similar to DreamBooth dreambooth.
  • Figure 4: Top three rows: images sampled around a circle (angle of AV on top) at different learned $Z$ of our 3D model. Bottom: the same circle for 2D model. Not that the 3D model is clearly superior than 2D. The images on first and last rows may directly be compared.
  • Figure 5: Both our 2D and 3D methods understand the emotions represented as continuous numbers. For 3D model, we showcase the behaviour towards the learned $Z$. These images illustrate that our learned representation is indeed continuous. Better viewed zoomed in.
  • ...and 4 more figures