Table of Contents
Fetching ...

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen

TL;DR

Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Abstract

The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

TL;DR

Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Abstract

The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
Paper Structure (41 sections, 7 equations, 8 figures, 13 tables)

This paper contains 41 sections, 7 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: This framework organizes tasks by increasing cognitive depth: (1) Perception for direct recognition of emotional cues; (2) Understanding for inferring emotional causality and context; and (3) Emotional Interaction for establishing an emotional connection with humans. Please refer to the appendix for details.
  • Figure 1: The facial Encoder extracts multiscale facial features and fuses them via an MLP to generate a rich facial embedding $E_{f}$. Subsequently, a temporal modeling block construct the sequence to output a final facial representation, which provides the language model with critical affective visual signals $E_{f}^c$. Fusion experts use audio features to guide vision and extract key complementary information $E_{mf}^i$.
  • Figure 2: The architecture of the Nano-EmoX. The visual branch extracts general visual emotional cues, the facial branch is responsible for modeling fine-grained facial details, the speech branch captures acoustic emotional cues. To balance the contribution of each modality, the fusion branch integrates key emotional cues from the audio-visual modalities and extracts complementary information. The language model integrates multimodal information and performs multitask emotion recognition.
  • Figure 2: More visualization results in ERI and ERG task.
  • Figure 3: The fusion encoder extracts multi-layer features from the visual and speech encoders and feeds them to three fusion experts with independent weights. Each expert extracts complementary information $E_{mf}^i$. Then, the gating network dynamically weighs the contribution $G_{i}$ of each expert and routes the feature $E_{mf}$ of the output.
  • ...and 3 more figures