Table of Contents
Fetching ...

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Qu Yang, Mang Ye, Bo Du

TL;DR

The paper addresses the gap in emotionally nuanced multimodal understanding by Multimodal Large Language Models (MLLMs). It introduces EmoBench, a large-scale emotional instruction-tuning benchmark, and EmoLLM, a model with Multi-perspective Visual Projection and EmoPrompt reasoning to improve emotion recognition and reasoning across diverse tasks. Empirical results show a 12.1% average improvement across foundation models on EmoBench, with EmoLLM outperforming several state-of-the-art MLLMs on emotion-related tasks while maintaining a smaller scale. The work contributes a benchmark, a specialized model architecture, and a methodology for guided multimodal emotional reasoning, with implications for HCI, mental health support, and empathetic AI applications, and commits to releasing code and data.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of 287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

TL;DR

The paper addresses the gap in emotionally nuanced multimodal understanding by Multimodal Large Language Models (MLLMs). It introduces EmoBench, a large-scale emotional instruction-tuning benchmark, and EmoLLM, a model with Multi-perspective Visual Projection and EmoPrompt reasoning to improve emotion recognition and reasoning across diverse tasks. Empirical results show a 12.1% average improvement across foundation models on EmoBench, with EmoLLM outperforming several state-of-the-art MLLMs on emotion-related tasks while maintaining a smaller scale. The work contributes a benchmark, a specialized model architecture, and a methodology for guided multimodal emotional reasoning, with implications for HCI, mental health support, and empathetic AI applications, and commits to releasing code and data.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of 287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.
Paper Structure (17 sections, 8 equations, 5 figures, 3 tables)

This paper contains 17 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Qualitative (a) and quantitative (b) comparison of EmoLLM with GPT4-Vision and other SOTA MLLMs. EmoLLM outperforms other models, particularly in recognizing nuanced emotions such as anger and sadness. (c) Overview of the diverse tasks in EmoBench, including emotional universal tasks, emotional application tasks (hate, sarcasm, and humor detection).
  • Figure 2: Overview of the EmoBench benchmark and its applications. (a) EmoBench is built upon a diverse content database. (c) The process of creating EmoBench involves expert template definition, diverse template set generation, and instruction generation. (d) The proposed EmoLLM is designed to leverage the EmoBench for improving the multi-modal emotional understanding capabilities.
  • Figure 3: Overview of the EmoLLM framework. (a) EmoLLM takes a user query and multimodal data as input, which are processed by a LLM and modality-specific encoders, respectively. (b) The Multi-perspective Visual Projection consists of various stages, each extracting features from visual tokens and building a graph connecting cluster centers. The combined representations form a comprehensive understanding of the emotional aspects.
  • Figure 4: Illustration of EmoPrompt. We utilize visual data and label pairs in EmoBench, and prompt GPT-4V gpt4 to generate logical chains.
  • Figure 5: Hyperparameter ablation in Multi-perspective Visual Projection and EmoPrompts. EmoLLM has the best performance when $\tau$ is 0.1. For EmoPrompts, diversified prompts can enhance the emotional reasoning ability of LLM.