Table of Contents
Fetching ...

The Super Emotion Dataset

Enric Junqué de Fortuny

TL;DR

The paper addresses the lack of a standardized, large-scale emotion dataset with a psychologically grounded taxonomy by constructing the SuperEmotion dataset. It aggregates multiple public emotion datasets and remaps labels to Shaver's six core emotions plus a neutral category, enabling cross-domain consistency in NLP emotion recognition. Key contributions include the largest Shaver-compliant resource (over 500k samples), a transparent preprocessing and label-harmonization pipeline, and public accessibility via HuggingFace, which mitigates taxonomic inconsistencies and class imbalance. The dataset supports robust affective NLP research with potential for future expansion to additional Shaver aspects and data sources, enhancing cross-domain applicability and methodological rigor.

Abstract

Despite the wide-scale usage and development of emotion classification datasets in NLP, the field lacks a standardized, large-scale resource that follows a psychologically grounded taxonomy. Existing datasets either use inconsistent emotion categories, suffer from limited sample size, or focus on specific domains. The Super Emotion Dataset addresses this gap by harmonizing diverse text sources into a unified framework based on Shaver's empirically validated emotion taxonomy, enabling more consistent cross-domain emotion recognition research.

The Super Emotion Dataset

TL;DR

The paper addresses the lack of a standardized, large-scale emotion dataset with a psychologically grounded taxonomy by constructing the SuperEmotion dataset. It aggregates multiple public emotion datasets and remaps labels to Shaver's six core emotions plus a neutral category, enabling cross-domain consistency in NLP emotion recognition. Key contributions include the largest Shaver-compliant resource (over 500k samples), a transparent preprocessing and label-harmonization pipeline, and public accessibility via HuggingFace, which mitigates taxonomic inconsistencies and class imbalance. The dataset supports robust affective NLP research with potential for future expansion to additional Shaver aspects and data sources, enhancing cross-domain applicability and methodological rigor.

Abstract

Despite the wide-scale usage and development of emotion classification datasets in NLP, the field lacks a standardized, large-scale resource that follows a psychologically grounded taxonomy. Existing datasets either use inconsistent emotion categories, suffer from limited sample size, or focus on specific domains. The Super Emotion Dataset addresses this gap by harmonizing diverse text sources into a unified framework based on Shaver's empirically validated emotion taxonomy, enabling more consistent cross-domain emotion recognition research.

Paper Structure

This paper contains 10 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Label co-occurrence heatmap showing the percentage of samples annotated with emotion $X$ (X-axis) that are also annotated with emotion $Y$ (Y-axis), denoted as $P(Y\,|\,X)=\frac{\#(X \cap Y)}{\#(X)}$. Diagonal values are always 100%, as each annotation trivially co-occurs with itself.