Table of Contents
Fetching ...

BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Yulin Wu, Bingquan Liu

TL;DR

BeMERC tackles multimodal emotion recognition in conversation by embedding video-derived speaker behaviors—facial micro-expressions, body language, and posture—into an MLLM-based MERC framework. It introduces a three-stage pipeline: video-derived behavior generation via Qwen2-VL, behavior alignment tuning with pseudo-labels, and MERC instruction tuning that fuses behavior signals with multimodal prompts for end-to-end training. The approach yields state-of-the-art accuracy and weighted-F1 on IEMOCAP and MELD, with ablations showing facial expressions contribute the largest gains and all three cues provide complementary benefits. This work highlights the practical importance of video-driven emotional dynamics in MERC and offers a scalable strategy to leverage video signals through instruction-tuned LLMs.

Abstract

Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.

BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

TL;DR

BeMERC tackles multimodal emotion recognition in conversation by embedding video-derived speaker behaviors—facial micro-expressions, body language, and posture—into an MLLM-based MERC framework. It introduces a three-stage pipeline: video-derived behavior generation via Qwen2-VL, behavior alignment tuning with pseudo-labels, and MERC instruction tuning that fuses behavior signals with multimodal prompts for end-to-end training. The approach yields state-of-the-art accuracy and weighted-F1 on IEMOCAP and MELD, with ablations showing facial expressions contribute the largest gains and all three cues provide complementary benefits. This work highlights the practical importance of video-driven emotional dynamics in MERC and offers a scalable strategy to leverage video signals through instruction-tuned LLMs.

Abstract

Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between video-derived behavior information and other descriptions in the task of MERC. BeMERC integrates facial expression, posture and body language into the MLLM to provide the accurate emotion for the given utterance in a conversation.
  • Figure 2: The overview of BeMERC. BeMERC includes video-derived behavior generation, video-derived behavior alignment tuning and MERC instruction tuning. In video-derived behavior alignment tuning stage, the generated behaviors are employed to enhance the LLM perceiving emotional dynamics. In the MERC instruction tuning stage, it improves the ability of the LLM to perform MERC tasks.
  • Figure 3: Predicted number of labels for each category on two datasets. 'Base' refers to 'LLMERC'. 'All' refers to 'BeMERC'.
  • Figure 4: Performance (F1) of each component of BeMERC across different emotional category on two datasets.
  • Figure 5: Visualization of thelearned embeddings.
  • ...and 3 more figures