Table of Contents
Fetching ...

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

Yingjie Zhou, Zicheng Zhang, Jiezhang Cao, Jun Jia, Yanwei Jiang, Farong Wen, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai

TL;DR

MEMO-Bench tackles AI emotion analysis by jointly evaluating emotion generation in T2I models and emotion comprehension in MLLMs. It introduces 7,145 AI-generated portraits across six emotions produced by 12 T2I models, with subjective MOS-based annotations to support both quality and emotion labeling. A progressive, coarse-to-fine evaluation for MLLMs reveals that while categorization is feasible, precise emotion intensity understanding remains poor, and T2I models favor positive emotion generation over negative. The benchmark exposes gaps in AI emotional intelligence and provides a publicly available resource to guide development of more emotion-aware multimodal systems.

Abstract

Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

TL;DR

MEMO-Bench tackles AI emotion analysis by jointly evaluating emotion generation in T2I models and emotion comprehension in MLLMs. It introduces 7,145 AI-generated portraits across six emotions produced by 12 T2I models, with subjective MOS-based annotations to support both quality and emotion labeling. A progressive, coarse-to-fine evaluation for MLLMs reveals that while categorization is feasible, precise emotion intensity understanding remains poor, and T2I models favor positive emotion generation over negative. The benchmark exposes gaps in AI emotional intelligence and provides a publicly available resource to guide development of more emotion-aware multimodal systems.

Abstract

Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.

Paper Structure

This paper contains 17 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: MEMO-Bench's overall idea of human sentiment analysis review for AI. Specifically, it includes the evaluation of emotion generation ability for Text-to-Image models and the evaluation of emotion comprehension ability for multimodal language large models.
  • Figure 2: Prompts for generating different emotions. Warm colors indicate positive emotions and cool colors indicate negative emotions.
  • Figure 3: Visualization of various T2I models' performance. On the left is the number of AGPIs that can be successfully generated by each type of T2I model based on different sentiment prompts, and on the right is a sample of AGPIs generated by each type of T2I model, including both successful and unsuccessful cases.
  • Figure 4: Distribution of MOSs for all AGPIs.
  • Figure 5: Effect of different factors on the distribution of MOSs.
  • ...and 2 more figures