Table of Contents
Fetching ...

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

Krish Patel, Dingkun Zhou, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

TL;DR

AV-EMO-Reasoning benchmarks emotional reasoning in omni-modal LLMs using synchronized audio-visual cues, addressing a gap in evaluating how models detect, interpret, and respond to user emotions across modalities and turns. The framework introduces an AV-CSER model for frame-level continuous emotion tracking and an AV-based categorical emotion detector, alongside cross-modal coherence metrics that combine continuous, categorical, and perceptual assessments. Experiments show visual cues improve emotional coherence over audio-only baselines, but models still struggle when prosody and facial cues conflict, indicating substantial headroom for improvement in cross-modal reasoning. By providing a reproducible benchmark and baselines on synthetic and real dialogues, AV-EMO-Reasoning advances towards more natural, adaptive, emotion-aware human-AI communication.

Abstract

Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a curated, single- and multi-turn synthetic audiovisual corpus with a real-world set and is assessed under continuous, categorical, and perceptual metrics. Experiments with leading LLMs show that visual cues reliably improve emotional coherence over audio-only baselines. Moreover, LLMs can leverage audio-visual cues to generate more emotion-aware speech. Models exhibit complementary strengths across metric families, indicating that automatic scores capture facets distinct from perceptual judgments. By releasing a systematic evaluation benchmark, AV-EMO-Reasoning offers a reproducible standard for evaluating emotion-aware dialogue and advances toward more natural, adaptive human-AI interaction.

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

TL;DR

AV-EMO-Reasoning benchmarks emotional reasoning in omni-modal LLMs using synchronized audio-visual cues, addressing a gap in evaluating how models detect, interpret, and respond to user emotions across modalities and turns. The framework introduces an AV-CSER model for frame-level continuous emotion tracking and an AV-based categorical emotion detector, alongside cross-modal coherence metrics that combine continuous, categorical, and perceptual assessments. Experiments show visual cues improve emotional coherence over audio-only baselines, but models still struggle when prosody and facial cues conflict, indicating substantial headroom for improvement in cross-modal reasoning. By providing a reproducible benchmark and baselines on synthetic and real dialogues, AV-EMO-Reasoning advances towards more natural, adaptive, emotion-aware human-AI communication.

Abstract

Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a curated, single- and multi-turn synthetic audiovisual corpus with a real-world set and is assessed under continuous, categorical, and perceptual metrics. Experiments with leading LLMs show that visual cues reliably improve emotional coherence over audio-only baselines. Moreover, LLMs can leverage audio-visual cues to generate more emotion-aware speech. Models exhibit complementary strengths across metric families, indicating that automatic scores capture facets distinct from perceptual judgments. By releasing a systematic evaluation benchmark, AV-EMO-Reasoning offers a reproducible standard for evaluating emotion-aware dialogue and advances toward more natural, adaptive human-AI interaction.

Paper Structure

This paper contains 15 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The AV-EMO-Reasoning Framework.
  • Figure 2: Continuous emotion predictions across modalities on the RECOLA.