"Only ChatGPT gets me": An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text
Florian Lecourt, Madalina Croitoru, Konstantin Todorov
TL;DR
The study evaluates how well large language models detect expressed emotions in text by comparing GPT-family models and other LLMs against a state-of-the-art baseline on the GoEmotions dataset, using the macro $F1_{macro}$ score. Prompt engineering markedly improves ChatGPT’s emotion-detection performance, but GPT models generally do not surpass specialized classifiers like BERT-based SOTA models. The results show GPT-4o offers marginal gains over GPT-3.5-Turbo, while very large models such as Llama-3-70b can approach but not exceed GPT-derived performance; dictionary-based corrections do not improve results. The work highlights the need for semantically aware metrics and multi-dataset validation to better capture nuanced emotion detection in AI systems intended for empathetic human–computer interaction.
Abstract
This work investigates the capabilities of large language models (LLMs) in detecting and understanding human emotions through text. Drawing upon emotion models from psychology, we adopt an interdisciplinary perspective that integrates computational and affective sciences insights. The main goal is to assess how accurately they can identify emotions expressed in textual interactions and compare different models on this specific task. This research contributes to broader efforts to enhance human-computer interaction, making artificial intelligence technologies more responsive and sensitive to users' emotional nuances. By employing a methodology that involves comparisons with a state-of-the-art model on the GoEmotions dataset, we aim to gauge LLMs' effectiveness as a system for emotional analysis, paving the way for potential applications in various fields that require a nuanced understanding of human language.
