Are Large Language Models Reliable Argument Quality Annotators?
Nailia Mirzakhmedova, Marcel Gohsen, Chia Hao Chang, Benno Stein
TL;DR
The paper investigates whether state-of-the-art LLMs can reliably annotate argument quality by comparing GPT-3 and PaLM 2 outputs to expert and novice human judgments on the Dagstuhl-15512-ArgQuality corpus using Wachsmuth's 15-dimension taxonomy. It deploys expert and simplified novice prompts, includes reasoning prompts, and measures inter-annotator agreement with Krippendorff's $\alpha$, finding that LLMs produce more consistent annotations than humans and that PaLM 2 often aligns better with human judgments. Moreover, integrating PaLM 2 annotations with human annotations significantly improves overall agreement, indicating a practical, semi-automatic workflow to scale argument quality assessment. The results suggest two deployment modes for LLMs as annotators: fully automatic annotation and augmentation as additional annotators, which can reduce manual effort and accelerate large-scale annotation tasks, though limitations regarding prompt sensitivity and generalizability remain. The work points to future enhancements via few-shot prompting, fine-tuning, and broader open-source LLM evaluation to further improve agreement with human judgments.
Abstract
Evaluating the quality of arguments is a crucial aspect of any system leveraging argument mining. However, it is a challenge to obtain reliable and consistent annotations regarding argument quality, as this usually requires domain-specific expertise of the annotators. Even among experts, the assessment of argument quality is often inconsistent due to the inherent subjectivity of this task. In this paper, we study the potential of using state-of-the-art large language models (LLMs) as proxies for argument quality annotators. To assess the capability of LLMs in this regard, we analyze the agreement between model, human expert, and human novice annotators based on an established taxonomy of argument quality dimensions. Our findings highlight that LLMs can produce consistent annotations, with a moderately high agreement with human experts across most of the quality dimensions. Moreover, we show that using LLMs as additional annotators can significantly improve the agreement between annotators. These results suggest that LLMs can serve as a valuable tool for automated argument quality assessment, thus streamlining and accelerating the evaluation of large argument datasets.
