Table of Contents
Fetching ...

Automatic Emotion Modelling in Written Stories

Lukas Christ, Shahin Amiriparian, Manuel Milling, Ilhan Aslan, Björn W. Schuller

TL;DR

A set of novel Transformer-based methods for predicting valence and arousal signals over the course of written stories using a pretrained ELECTRA model and studying the benefits of considering a sentence’s context when inferring its emotionality.

Abstract

Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modelling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no labelled benchmark for this task. We address this gap by introducing continuous valence and arousal annotations for an existing dataset of children's stories annotated with discrete emotion categories. We collect additional annotations for this data and map the originally categorical labels to the valence and arousal space. Leveraging recent advances in Natural Language Processing, we propose a set of novel Transformer-based methods for predicting valence and arousal signals over the course of written stories. We explore several strategies for fine-tuning a pretrained ELECTRA model and study the benefits of considering a sentence's context when inferring its emotionality. Moreover, we experiment with additional LSTM and Transformer layers. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .7338 for valence and .6302 for arousal on the test set, demonstrating the suitability of our proposed approach. Our code and additional annotations are made available at https://github.com/lc0197/emotion_modelling_stories.

Automatic Emotion Modelling in Written Stories

TL;DR

A set of novel Transformer-based methods for predicting valence and arousal signals over the course of written stories using a pretrained ELECTRA model and studying the benefits of considering a sentence’s context when inferring its emotionality.

Abstract

Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modelling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no labelled benchmark for this task. We address this gap by introducing continuous valence and arousal annotations for an existing dataset of children's stories annotated with discrete emotion categories. We collect additional annotations for this data and map the originally categorical labels to the valence and arousal space. Leveraging recent advances in Natural Language Processing, we propose a set of novel Transformer-based methods for predicting valence and arousal signals over the course of written stories. We explore several strategies for fine-tuning a pretrained ELECTRA model and study the benefits of considering a sentence's context when inferring its emotionality. Moreover, we experiment with additional LSTM and Transformer layers. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .7338 for valence and .6302 for arousal on the test set, demonstrating the suitability of our proposed approach. Our code and additional annotations are made available at https://github.com/lc0197/emotion_modelling_stories.
Paper Structure (30 sections, 6 figures, 5 tables)

This paper contains 30 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Boxplot of sample durations for each EmoSet corpus. Boxes show the inner-quartile range (IQR) and the whiskers extend to a maximum of $1.5\times IQR$ measured from the lower and higher quartiles. Black dots are considered as outliers. For readability, the scale of the y-axis is logarithmic from $10^{1}$ upwards.
  • Figure 2: Histogram of sample durations in EmoSet. Most of the samples are between 1 to $5$ seconds in length.
  • Figure 3: Sample Mel spectrogram images created from speech recordings of IEMOCAP for each of its four base emotion categories. From left to right: angry, happy, neutral, and sad.
  • Figure 4: Architecture of the base ResNet model used in the experiments for multi-corpus SER. Three convolutional stacks extract features from the generated mel-spectrogram input. A 2D attention module is then applied to reduce the variable length output of the convolutional base to a single feature vector for further processing by a MLP classifier head. $nf$ specifies the number of filters ($\#f$) of all convolutions inside a specific residual stack.
  • Figure 5: Depiction of a residual adapter module. The adapter is a task-specific small convolution ($1\times1$) that is applied in parallel to all convolutions of the shared base model. The outputs of both convolutions are then combined by their elementwise summation. Additionally, the subsequent BN which is not shared between corpora is shown in the figure.
  • ...and 1 more figures