Table of Contents
Fetching ...

Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Han Zhang, Yu Lu, Liyun Zhang, Dian Ding, Dinghua Zhao, Yi-Chao Chen, Ye Wu, Guangtao Xue

TL;DR

Lantern tackles the challenge of multi-modal emotion recognition by integrating a lightweight multi-task vanilla model that predicts emotion class probabilities and dimension scores from multimedia inputs with prompting of frozen large language models. A sliding-window strategy provides multiple receptive fields, and a receptive-field-aware attention merge combines predictions across fields, enabling external knowledge and context from LLMs to refine the final decisions. The approach yields consistent gains on IEMOCAP across 4-way and 6-way settings, with improvements up to approximately 1.8 percentage points in accuracy when pairing CORECT with GPT‑4 or Llama‑3.1‑405B, while remaining resource-efficient (single GPU for the vanilla model and cloud-based LLMs), and demonstrates the value of dimension scores as complementary signals for emotion recognition. Overall, Lantern demonstrates a practical pathway to leverage powerful LLMs for multimodal emotion understanding without prohibitive multimodal LLM costs, broadening the applicability of external-knowledge-enhanced inference in dialogue reasoning.

Abstract

Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

TL;DR

Lantern tackles the challenge of multi-modal emotion recognition by integrating a lightweight multi-task vanilla model that predicts emotion class probabilities and dimension scores from multimedia inputs with prompting of frozen large language models. A sliding-window strategy provides multiple receptive fields, and a receptive-field-aware attention merge combines predictions across fields, enabling external knowledge and context from LLMs to refine the final decisions. The approach yields consistent gains on IEMOCAP across 4-way and 6-way settings, with improvements up to approximately 1.8 percentage points in accuracy when pairing CORECT with GPT‑4 or Llama‑3.1‑405B, while remaining resource-efficient (single GPU for the vanilla model and cloud-based LLMs), and demonstrates the value of dimension scores as complementary signals for emotion recognition. Overall, Lantern demonstrates a practical pathway to leverage powerful LLMs for multimodal emotion understanding without prohibitive multimodal LLM costs, broadening the applicability of external-knowledge-enhanced inference in dialogue reasoning.

Abstract

Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Integrating external knowledge: The first approach uses a pre-trained language model to generate a new modality from text with external knowledge. The second approach leverages an LLM to process dialogue transcriptions. Our framework uses a vanilla model to process multimedia modalities and generate intermediate supportive information for LLM refinement.
  • Figure 2: Our framework Lantern has three stages: first, a vanilla model pre-processes all modalities to produce a preliminary prediction of the probabilities and the dimension scores for each sample. Second, prompts with the preliminary predictions and the transcriptions are fed into a frozen LLM for further adjustment. Each sample will be included in $t$ prompts. Finally, a receptive-field-aware attention algorithm is implemented to assign weights for the $t+1$ predictions to form the final prediction.
  • Figure 3: Methods to predict metrics: Figure \ref{['fig:vanilla']}a described a single-task model, where a dedicated backbone is used to extract specific to a task. Figure \ref{['fig:vanilla']}b demonstrated a multi-task pattern, where backbone extract features suitable for both metrics and the predictions are based on the same feature.
  • Figure 4: Strategies to split a dialogue: Figure \ref{['fig:split']}a is the naive splitting, where each receptive field is not overlapped with each other. Figure \ref{['fig:split']}b pads some samples at the beginning and the end of each receptive field. The $\times t$ and $t\times$ mean that repeat this receptive field for $t$ times. Figure \ref{['fig:split']}c demonstrated the sliding window strategy when $t=3$, which provides different receptive fields for each sample, mining dialogue features of different perspective views.
  • Figure 5: LDA coefficients
  • ...and 1 more figures