Fearful Falcons and Angry Llamas: Emotion Category Annotations of Arguments by Humans and LLMs
Lynn Greschner, Roman Klinger
TL;DR
This study introduces Emo-DeFaBel, the first corpus of discrete emotion categories annotated in German argumentative texts, to address the gap between NLP emotion analysis and psychological theories of emotion in persuasion. By crowdsourcing human labels for 300 DeFaBel-derived arguments and evaluating three instruction-tuned LLMs (Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini) across binary, closed-domain, and open-domain emotion prompts, the authors examine how well discrete emotion categories predict emotionality and how reliably models assign specific emotions. Key findings show that including discrete emotion labels improves emotionality prediction but that all models exhibit a bias toward negative emotions, with limited precision in predicting individual emotion classes, especially under strict evaluation. The work provides a foundation for future improvements in prompt design and model fine-tuning for argument-emotion analysis, and discusses ethical and methodological considerations for crowd-sourced annotation and LLM-based labeling practices. Overall, Emo-DeFaBel advances understanding of how discrete emotions shape argument persuasiveness and highlights both opportunities and challenges in leveraging LLMs for fine-grained emotion annotation in argumentative text.
Abstract
Arguments evoke emotions, influencing the effect of the argument itself. Not only the emotional intensity but also the category influence the argument's effects, for instance, the willingness to adapt stances. While binary emotionality has been studied in arguments, there is no work on discrete emotion categories (e.g., "Anger") in such data. To fill this gap, we crowdsource subjective annotations of emotion categories in a German argument corpus and evaluate automatic LLM-based labeling methods. Specifically, we compare three prompting strategies (zero-shot, one-shot, chain-of-thought) on three large instruction-tuned language models (Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini). We further vary the definition of the output space to be binary (is there emotionality in the argument?), closed-domain (which emotion from a given label set is in the argument?), or open-domain (which emotion is in the argument?). We find that emotion categories enhance the prediction of emotionality in arguments, emphasizing the need for discrete emotion annotations in arguments. Across all prompt settings and models, automatic predictions show a high recall but low precision for predicting anger and fear, indicating a strong bias toward negative emotions.
