Table of Contents
Fetching ...

Improved Emotional Alignment of AI and Humans: Human Ratings of Emotions Expressed by Stable Diffusion v1, DALL-E 2, and DALL-E 3

James Derek Lomas, Willem van der Maden, Sohhom Bandyopadhyay, Giovanni Lion, Nirmal Patel, Gyanesh Jain, Yanna Litowsky, Haian Xue, Pieter Desmet

TL;DR

The study addresses how well AI-generated emotional expressions align with human perception, a key concern for emotion-aware AI in wellbeing contexts. It introduces a human-rating benchmark using three image generators (Stable Diffusion v1, DALL-E 2, DALL-E 3) to evaluate 240 prompts across ten emotions in person and robot contexts, with 24 participants rating alignment on a 0–10 scale. The work provides a general evaluation procedure, a sizable dataset of over 5,700 ratings, and insights into model differences and context–emotion interactions, highlighting that newer models (DALL-E 3) improve alignment but context and certain emotions remain challenging. These findings offer a scalable metric for tracking emotional alignment improvements and guide design considerations for emotion-aware AI in mental-health applications, while also underscoring ethical risks such as manipulation and inauthenticity.

Abstract

Generative AI systems are increasingly capable of expressing emotions via text and imagery. Effective emotional expression will likely play a major role in the efficacy of AI systems -- particularly those designed to support human mental health and wellbeing. This motivates our present research to better understand the alignment of AI expressed emotions with the human perception of emotions. When AI tries to express a particular emotion, how might we assess whether they are successful? To answer this question, we designed a survey to measure the alignment between emotions expressed by generative AI and human perceptions. Three generative image models (DALL-E 2, DALL-E 3 and Stable Diffusion v1) were used to generate 240 examples of images, each of which was based on a prompt designed to express five positive and five negative emotions across both humans and robots. 24 participants recruited from the Prolific website rated the alignment of AI-generated emotional expressions with a text prompt used to generate the emotion (i.e., "A robot expressing the emotion amusement"). The results of our evaluation suggest that generative AI models are indeed capable of producing emotional expressions that are well-aligned with a range of human emotions; however, we show that the alignment significantly depends upon the AI model used and the emotion itself. We analyze variations in the performance of these systems to identify gaps for future improvement. We conclude with a discussion of the implications for future AI systems designed to support mental health and wellbeing.

Improved Emotional Alignment of AI and Humans: Human Ratings of Emotions Expressed by Stable Diffusion v1, DALL-E 2, and DALL-E 3

TL;DR

The study addresses how well AI-generated emotional expressions align with human perception, a key concern for emotion-aware AI in wellbeing contexts. It introduces a human-rating benchmark using three image generators (Stable Diffusion v1, DALL-E 2, DALL-E 3) to evaluate 240 prompts across ten emotions in person and robot contexts, with 24 participants rating alignment on a 0–10 scale. The work provides a general evaluation procedure, a sizable dataset of over 5,700 ratings, and insights into model differences and context–emotion interactions, highlighting that newer models (DALL-E 3) improve alignment but context and certain emotions remain challenging. These findings offer a scalable metric for tracking emotional alignment improvements and guide design considerations for emotion-aware AI in mental-health applications, while also underscoring ethical risks such as manipulation and inauthenticity.

Abstract

Generative AI systems are increasingly capable of expressing emotions via text and imagery. Effective emotional expression will likely play a major role in the efficacy of AI systems -- particularly those designed to support human mental health and wellbeing. This motivates our present research to better understand the alignment of AI expressed emotions with the human perception of emotions. When AI tries to express a particular emotion, how might we assess whether they are successful? To answer this question, we designed a survey to measure the alignment between emotions expressed by generative AI and human perceptions. Three generative image models (DALL-E 2, DALL-E 3 and Stable Diffusion v1) were used to generate 240 examples of images, each of which was based on a prompt designed to express five positive and five negative emotions across both humans and robots. 24 participants recruited from the Prolific website rated the alignment of AI-generated emotional expressions with a text prompt used to generate the emotion (i.e., "A robot expressing the emotion amusement"). The results of our evaluation suggest that generative AI models are indeed capable of producing emotional expressions that are well-aligned with a range of human emotions; however, we show that the alignment significantly depends upon the AI model used and the emotion itself. We analyze variations in the performance of these systems to identify gaps for future improvement. We conclude with a discussion of the implications for future AI systems designed to support mental health and wellbeing.
Paper Structure (23 sections, 7 figures, 3 tables)

This paper contains 23 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: "A person expressing the emotion amusement" by DALL-E 2 (top) and DALL-E 3 (bottom)
  • Figure 2: Example of an alignment survey question
  • Figure 3: Bar chart showing main effects. Each error bar is constructed using 1 standard error from the mean.
  • Figure 4: Bar chart of 2-way interactions between the factors AI model and the Context. This shows the difference in performance when expressing emotions by persons or robots. Each error bar is constructed using 1 standard error from the mean.
  • Figure 5: Bar chart of three-way interactions between the Context, AI Model and the Emotion factors. Each error bar is constructed using 1 standard error from the mean.
  • ...and 2 more figures