Table of Contents
Fetching ...

GoEmotions: A Dataset of Fine-Grained Emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi

TL;DR

GoEmotions introduces a large, manually annotated dataset of 58k Reddit comments labeled with 27 emotion categories plus Neutral, enabling fine-grained, multi-label emotion classification in NLP. The authors implement a rigorous data collection and labeling pipeline, validate annotation reliability with Principal Preserved Component Analysis, and establish a strong BERT-based baseline that achieves 0.46 average F1 on the full taxonomy. They demonstrate the dataset's generalizability through transfer learning to emotion benchmarks across domains and taxonomies, highlighting the practical value of large, high-quality emotion annotations for cross-domain understanding. The work also provides insights into linguistic correlates of emotions and discusses biases and limitations, offering a foundation for improved emotion-aware NLP systems and future multilingual extensions.

Abstract

Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.

GoEmotions: A Dataset of Fine-Grained Emotions

TL;DR

GoEmotions introduces a large, manually annotated dataset of 58k Reddit comments labeled with 27 emotion categories plus Neutral, enabling fine-grained, multi-label emotion classification in NLP. The authors implement a rigorous data collection and labeling pipeline, validate annotation reliability with Principal Preserved Component Analysis, and establish a strong BERT-based baseline that achieves 0.46 average F1 on the full taxonomy. They demonstrate the dataset's generalizability through transfer learning to emotion benchmarks across domains and taxonomies, highlighting the practical value of large, high-quality emotion annotations for cross-domain understanding. The work also provides insights into linguistic correlates of emotions and discusses biases and limitations, offering a foundation for improved emotion-aware NLP systems and future multilingual extensions.

Abstract

Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.

Paper Structure

This paper contains 46 sections, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our emotion categories, ordered by the number of examples where at least one rater uses a particular label. The color indicates the interrater correlation.
  • Figure 2: The heatmap shows the correlation between ratings for each emotion. The dendrogram represents the a hierarchical clustering of the ratings. The sentiment labeling was done a priori and it shows that the clusters closely map onto sentiment groups.
  • Figure 3: Transfer learning results in terms of average F1-scores across emotion categories. The bars indicate the 95% confidence intervals, which we obtain from 10 different runs on 10 different random splits of the data.
  • Figure 4: Softmax weights of each BERT layer when trained on our dataset.
  • Figure 5: Number of emotion labels per example before and after filtering the labels chosen by only a single annotator.
  • ...and 2 more figures