Table of Contents
Fetching ...

Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network

Yuan Gao, Hao Shi, Yahui Fu, Chenhui Chu, Tatsuya Kawahara

TL;DR

This work investigates how personality information influences speech emotion recognition (SER) and introduces PA-IEMOCAP, a version of IEMOCAP annotated with Big Five traits. It shows strong correlations between personality and emotion expression, especially valence, and proposes a temporal interaction condition network (TICN) to fuse personality features with HuBERT acoustic representations, using cross-attention for dynamic fusion. The study demonstrates that both ground-truth and predicted personality traits can substantially improve SER, particularly valence recognition, with GT traits achieving up to ~0.785 CCC and predicted traits achieving ~0.776 CCC when derived from conversation-level PR. The findings establish a foundation for personality-aware speech processing, including robust methods for leveraging predicted user traits in dialogue systems and guiding future work with more naturalistic data.

Abstract

This study investigates the interaction between personality traits and emotion expression, exploring how personality information can improve speech emotion recognition (SER). We collect the personality annotation for the IEMOCAP dataset, making it the first speech dataset that contains both emotion and personality annotations (PA-IEMOCAP), and enabling direct integration of personality traits into SER. Statistical analysis on this dataset identified significant correlations between personality traits and emotional expressions. To extract finegrained personality features, we propose a temporal interaction condition network (TICN), in which personality features are integrated with HuBERT-based acoustic features for SER. Experiments show that incorporating ground-truth personality traits significantly enhances valence recognition, improving the concordance correlation coefficient (CCC) from 0.698 to 0.785 compared to the baseline without personality information. For practical applications in dialogue systems where personality information about the user is unavailable, we develop a front-end module of automatic personality recognition. Using these automatically predicted traits as inputs to our proposed TICN model, we achieve a CCC of 0.776 for valence recognition, representing an 11.17% relative improvement over the baseline. These findings confirm the effectiveness of personality-aware SER and provide a solid foundation for further exploration in personality-aware speech processing applications.

Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network

TL;DR

This work investigates how personality information influences speech emotion recognition (SER) and introduces PA-IEMOCAP, a version of IEMOCAP annotated with Big Five traits. It shows strong correlations between personality and emotion expression, especially valence, and proposes a temporal interaction condition network (TICN) to fuse personality features with HuBERT acoustic representations, using cross-attention for dynamic fusion. The study demonstrates that both ground-truth and predicted personality traits can substantially improve SER, particularly valence recognition, with GT traits achieving up to ~0.785 CCC and predicted traits achieving ~0.776 CCC when derived from conversation-level PR. The findings establish a foundation for personality-aware speech processing, including robust methods for leveraging predicted user traits in dialogue systems and guiding future work with more naturalistic data.

Abstract

This study investigates the interaction between personality traits and emotion expression, exploring how personality information can improve speech emotion recognition (SER). We collect the personality annotation for the IEMOCAP dataset, making it the first speech dataset that contains both emotion and personality annotations (PA-IEMOCAP), and enabling direct integration of personality traits into SER. Statistical analysis on this dataset identified significant correlations between personality traits and emotional expressions. To extract finegrained personality features, we propose a temporal interaction condition network (TICN), in which personality features are integrated with HuBERT-based acoustic features for SER. Experiments show that incorporating ground-truth personality traits significantly enhances valence recognition, improving the concordance correlation coefficient (CCC) from 0.698 to 0.785 compared to the baseline without personality information. For practical applications in dialogue systems where personality information about the user is unavailable, we develop a front-end module of automatic personality recognition. Using these automatically predicted traits as inputs to our proposed TICN model, we achieve a CCC of 0.776 for valence recognition, representing an 11.17% relative improvement over the baseline. These findings confirm the effectiveness of personality-aware SER and provide a solid foundation for further exploration in personality-aware speech processing applications.

Paper Structure

This paper contains 23 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overall flowchat of the proposed approach. We use all the utterances of a whole conversation to predict personality traits. Note that, we conduct independent experiments for prediction of each personality traits. The predicted traits are then projected by the proposed temporal interaction condition network (TICN) for improving SER.
  • Figure : (a) Concat.
  • Figure : (a) Emotion recognition results
  • Figure : (a) Setting 1
  • Figure : (a) Concat.
  • ...and 7 more figures