The Super Emotion Dataset
Enric Junqué de Fortuny
TL;DR
The paper addresses the lack of a standardized, large-scale emotion dataset with a psychologically grounded taxonomy by constructing the SuperEmotion dataset. It aggregates multiple public emotion datasets and remaps labels to Shaver's six core emotions plus a neutral category, enabling cross-domain consistency in NLP emotion recognition. Key contributions include the largest Shaver-compliant resource (over 500k samples), a transparent preprocessing and label-harmonization pipeline, and public accessibility via HuggingFace, which mitigates taxonomic inconsistencies and class imbalance. The dataset supports robust affective NLP research with potential for future expansion to additional Shaver aspects and data sources, enhancing cross-domain applicability and methodological rigor.
Abstract
Despite the wide-scale usage and development of emotion classification datasets in NLP, the field lacks a standardized, large-scale resource that follows a psychologically grounded taxonomy. Existing datasets either use inconsistent emotion categories, suffer from limited sample size, or focus on specific domains. The Super Emotion Dataset addresses this gap by harmonizing diverse text sources into a unified framework based on Shaver's empirically validated emotion taxonomy, enabling more consistent cross-domain emotion recognition research.
