Table of Contents
Fetching ...

CHUCKLE -- When Humans Teach AI To Learn Emotions The Easy Way

Ankush Pratap Singh, Houwei Cao, Yong Liu

TL;DR

This work tackles the subjectivity and label noise inherent in speech emotion recognition by introducing CHUCKLE, a perception-driven curriculum learning framework that uses human annotator agreement and alignment between intended and perceived emotions to define sample difficulty. It combines data-driven difficulty scores with rule-based curricula, and demonstrates that rule-based curricula provide the most robust gains. On CREMA-D, CHUCKLE yields a relative accuracy improvement of up to 6.56% for LSTMs and 1.61% for Transformers, while substantially reducing gradient updates, highlighting improved training efficiency and model robustness. Overall, the approach shows that incorporating human perception signals into curriculum design can meaningfully enhance SER performance across architectures and reduce computational cost.

Abstract

Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.

CHUCKLE -- When Humans Teach AI To Learn Emotions The Easy Way

TL;DR

This work tackles the subjectivity and label noise inherent in speech emotion recognition by introducing CHUCKLE, a perception-driven curriculum learning framework that uses human annotator agreement and alignment between intended and perceived emotions to define sample difficulty. It combines data-driven difficulty scores with rule-based curricula, and demonstrates that rule-based curricula provide the most robust gains. On CREMA-D, CHUCKLE yields a relative accuracy improvement of up to 6.56% for LSTMs and 1.61% for Transformers, while substantially reducing gradient updates, highlighting improved training efficiency and model robustness. Overall, the approach shows that incorporating human perception signals into curriculum design can meaningfully enhance SER performance across architectures and reduce computational cost.

Abstract

Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.

Paper Structure

This paper contains 9 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The proposed framework maps human annotations into perception-based difficulty bins that guide curriculum training, improving both model accuracy and efficiency.
  • Figure 2: Model Performances for Intended-Perceived Agreement 1 (Left to Right): (a) LSTM Training Loss (single trial), (b) LSTM Mean Macro Accuracy across stages (all trials), (c) Transformer Training Loss (single trial), (d) Transformer Mean Macro Accuracy across stages (all trials).
  • Figure 3: Training Cost Comparisons (Left to Right): Mean Macro Accuracy vs Gradient Updates (a) LSTMs, (b) Transformers.