Table of Contents
Fetching ...

Cluster-to-Predict Affect Contours from Speech

Gökhan Kuşçu, Engin Erzin

TL;DR

The paper reframes continuous emotion recognition as predicting dynamic affect-contour clusters from speech using a cluster-to-predict (C2P) self-supervised framework. It introduces AffectNet and SpeechNet that jointly learn latent affect representations and map speech to affect-contour clusters, with alternating updates and k-means clustering on latent features. On the RECOLA dataset, the approach achieves strong four-class classification performance, including F1 scores of 0.84 for arousal and 0.75 for valence, indicating the learned contours capture meaningful emotional dynamics. This method offers a new, robust representation of emotion in the AV space and suggests promising directions for SSL-based SER and cluster-driven emotion modeling.

Abstract

Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model.

Cluster-to-Predict Affect Contours from Speech

TL;DR

The paper reframes continuous emotion recognition as predicting dynamic affect-contour clusters from speech using a cluster-to-predict (C2P) self-supervised framework. It introduces AffectNet and SpeechNet that jointly learn latent affect representations and map speech to affect-contour clusters, with alternating updates and k-means clustering on latent features. On the RECOLA dataset, the approach achieves strong four-class classification performance, including F1 scores of 0.84 for arousal and 0.75 for valence, indicating the learned contours capture meaningful emotional dynamics. This method offers a new, robust representation of emotion in the AV space and suggests promising directions for SSL-based SER and cluster-driven emotion modeling.

Abstract

Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model.
Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Block diagram of the proposed C2P network
  • Figure 2: Arousal and valence contour mean and standard deviations for each C2P cluster
  • Figure 3: Affect cluster pairing percents for arousal and valence over the RECOLA dataset with contour cluster trends indicated as flat (no-change), increasing, or decreasing