INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition
Andreas Triantafyllopoulos, Anton Batliner, Simon Rampp, Manuel Milling, Björn Schuller
TL;DR
This paper reassesses the INTERSPEECH 2009 FAU-AIBO speech emotion recognition benchmark fifteen years later by evaluating a wide spectrum of deep learning models, from feature-based to end-to-end and self-supervised approaches. Using a two-phase protocol (43 models with fixed hyperparameters, followed by grid-search tuning over 90 configurations) and evaluating on the official test set, the study finds only modest gains over the original baselines, with several models even underperforming. The results reveal non-monotonic progress, limited correlation between model size or publication year and performance, and substantial inter-model variability, suggesting that models learn complementary representations rather than converging on a single optimal solution. Overall, the work underscores persistent challenges of FAU-AIBO and cautions against assuming linear progress in SER without standardized, cross-dataset benchmarking.
Abstract
We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge -- and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid search. Results are always reported on the official test set with a separate validation set only used for early stopping. Most models score below or close to the official baseline, while they marginally outperform the original challenge winners after hyperparameter tuning. Our work illustrates that, despite recent progress, FAU-AIBO remains a very challenging benchmark. An interesting corollary is that newer methods do not consistently outperform older ones, showing that progress towards `solving' SER is not necessarily monotonic.
