Table of Contents
Fetching ...

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

Wenjie Tian, Zhixian Zhao, Jingbin Hu, Huakang Chen, Haohe Liu, Binshen Mu, Lei Xie

TL;DR

This work presents EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue, and introduces the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response.

Abstract

The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

TL;DR

This work presents EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue, and introduces the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response.

Abstract

The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.
Paper Structure (35 sections, 5 equations, 3 figures, 8 tables)

This paper contains 35 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The overall framework of EmoOmni. The system mimics human affective cognition through a Perception-Reasoning-Expression causal chain.
  • Figure 2: Overview of the EmoOmniPipe. The data pipeline consists of three main stages: (1) processing of raw data, (2) multimodal annotation, and (3) construction of E-CoT and dialogue data.
  • Figure 3: Overview of EmoOmni. The left part illustrates the overall pipeline of our multimodal emotional dialogue system. Given audio–visual inputs, the Thinker module performs high-level reasoning and produces the emotional Chain-of-Thought, which consists of four components. The generated textual response is then fed into the talker module for expressive speech synthesis. The right part shows the architecture of the talker, which is built upon an autoregressive TTS model. The response strategy is further processed by a lightweight language model to generate controllable emotion instructions, enabling the talker to synthesize expressive speech.