Table of Contents
Fetching ...

From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs

Minxue Niu, Mimansa Jaiswal, Emily Mower Provost

TL;DR

It is found that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and the human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators.

Abstract

Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.

From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs

TL;DR

It is found that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and the human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators.

Abstract

Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.
Paper Structure (12 sections, 4 figures, 3 tables)

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Disagreements between human and GPT annotations on ISEAR.
  • Figure 2: Human vs. GPT-4 classification.
  • Figure 3: GPT-4 classification vs. generation.
  • Figure 4: Human preference ratio comparing human annotations, GPT-4 classification annotations and GPT-4 generation annotations on emotion classification tasks.