Table of Contents
Fetching ...

Reinforcing Trustworthiness in Multimodal Emotional Support Systems

Huy M. Le, Dat Tien Nguyen, Ngan T. T. Vo, Tuan D. Q. Nguyen, Nguyen Binh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Lizi Liao, Binh T. Nguyen

TL;DR

MultiMood addresses the challenge of trustworthy, empathetic emotional support by fusing video, audio, and text to infer user states and generate therapist-like responses. The architecture employs modality-specific encoders, cross-modality fusion, and a memory-efficient ConvCompressor, and is trained in two stages: supervised fine-tuning and reinforcement learning aligned with seven trustworthiness dimensions. GRPO and PPO optimization, guided by GPT-4o and ColBERT-based similarity, balance truthfulness, safety, privacy, fairness, empathy, reliability, and ethical guidance with generation quality. Empirical results on the MESC and DFEW datasets show state-of-the-art emotion recognition and response generation, substantial history-token reduction, and improved human/LLM trust assessments, signaling practical potential for AI-assisted emotional support.

Abstract

In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.

Reinforcing Trustworthiness in Multimodal Emotional Support Systems

TL;DR

MultiMood addresses the challenge of trustworthy, empathetic emotional support by fusing video, audio, and text to infer user states and generate therapist-like responses. The architecture employs modality-specific encoders, cross-modality fusion, and a memory-efficient ConvCompressor, and is trained in two stages: supervised fine-tuning and reinforcement learning aligned with seven trustworthiness dimensions. GRPO and PPO optimization, guided by GPT-4o and ColBERT-based similarity, balance truthfulness, safety, privacy, fairness, empathy, reliability, and ethical guidance with generation quality. Empirical results on the MESC and DFEW datasets show state-of-the-art emotion recognition and response generation, substantial history-token reduction, and improved human/LLM trust assessments, signaling practical potential for AI-assisted emotional support.

Abstract

In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.

Paper Structure

This paper contains 44 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Example conversation illustrating the difference between prior systems and MultiMood. Prior methods respond with factual queries, whereas MultiMood demonstrates emotional awareness and offers empathetic, supportive feedback.
  • Figure 2: MultiMood overview. Multimodal architecture that processes video, audio, text, and historical conversation data through dedicated encoders. The modality-specific embeddings are fused and passed into an LLM, which is further optimized using reinforcement learning guided by trustworthiness criteria to generate emotionally supportive and responsible responses.
  • Figure 3: ConvCompressor architecture and pretraining.
  • Figure 4: Full prompt used for training and inference
  • Figure 5: Our human evaluation process.
  • ...and 2 more figures