Table of Contents
Fetching ...

Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

Maya Srikanth, Run Chen, Julia Hirschberg

TL;DR

This paper investigates why multimodal empathy detection fails when cues across text, audio, and video conflict, proposing disagreement as a diagnostic signal. It uses fine-tuned unimodal models and a gated fusion mechanism on the EmpSpeech dataset to analyze when and why modality disagreements occur, revealing annotator uncertainty and fusion biases toward certain cues. Key findings show that dominant cues in one modality can mislead fusion, and humans do not consistently benefit from multimodal input in these tasks, underscoring the value of disagreement-based analysis. The work offers a scalable framework for identifying ambiguous examples, informing annotation strategies, curriculum design, and adaptive fusion approaches to improve robustness in socially grounded AI systems.

Abstract

Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

TL;DR

This paper investigates why multimodal empathy detection fails when cues across text, audio, and video conflict, proposing disagreement as a diagnostic signal. It uses fine-tuned unimodal models and a gated fusion mechanism on the EmpSpeech dataset to analyze when and why modality disagreements occur, revealing annotator uncertainty and fusion biases toward certain cues. Key findings show that dominant cues in one modality can mislead fusion, and humans do not consistently benefit from multimodal input in these tasks, underscoring the value of disagreement-based analysis. The work offers a scalable framework for identifying ambiguous examples, informing annotation strategies, curriculum design, and adaptive fusion approaches to improve robustness in socially grounded AI systems.

Abstract

Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

Paper Structure

This paper contains 26 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Given classifications provided by a single modality, we identify cases where integrating additional modalities leads to a different prediction. We analyze these differences to understand when and why they occur.
  • Figure 2: Comparing predictions between unimodal (text, audio, video) and multimodal models. We highlight regions where model predictions agree (blue and yellow quadrants) and disagree (red and green quadrants).
  • Figure 3: UMAP of text-only embeddings for empathetic (left) vs. neutral (right) examples, colored by modality disagreement; red and green points cluster near the decision boundary, marking ambiguous cases.
  • Figure 4: Annotation interface
  • Figure 5: Distribution of audio features for red, green and blue examples across the confidence quadrants. Red examples are those correctly classified by the unimodal audio model but misclassified by the multimodal model; green examples represent the reverse. Blue examples represent those correctly classified by both the unimodal audio model and the multimodal model. Significant differences appear in pitch and intensity-based features.
  • ...and 1 more figures