Are Large Language Models Reliable Argument Quality Annotators?

Nailia Mirzakhmedova; Marcel Gohsen; Chia Hao Chang; Benno Stein

Are Large Language Models Reliable Argument Quality Annotators?

Nailia Mirzakhmedova, Marcel Gohsen, Chia Hao Chang, Benno Stein

TL;DR

The paper investigates whether state-of-the-art LLMs can reliably annotate argument quality by comparing GPT-3 and PaLM 2 outputs to expert and novice human judgments on the Dagstuhl-15512-ArgQuality corpus using Wachsmuth's 15-dimension taxonomy. It deploys expert and simplified novice prompts, includes reasoning prompts, and measures inter-annotator agreement with Krippendorff's $\alpha$, finding that LLMs produce more consistent annotations than humans and that PaLM 2 often aligns better with human judgments. Moreover, integrating PaLM 2 annotations with human annotations significantly improves overall agreement, indicating a practical, semi-automatic workflow to scale argument quality assessment. The results suggest two deployment modes for LLMs as annotators: fully automatic annotation and augmentation as additional annotators, which can reduce manual effort and accelerate large-scale annotation tasks, though limitations regarding prompt sensitivity and generalizability remain. The work points to future enhancements via few-shot prompting, fine-tuning, and broader open-source LLM evaluation to further improve agreement with human judgments.

Abstract

Evaluating the quality of arguments is a crucial aspect of any system leveraging argument mining. However, it is a challenge to obtain reliable and consistent annotations regarding argument quality, as this usually requires domain-specific expertise of the annotators. Even among experts, the assessment of argument quality is often inconsistent due to the inherent subjectivity of this task. In this paper, we study the potential of using state-of-the-art large language models (LLMs) as proxies for argument quality annotators. To assess the capability of LLMs in this regard, we analyze the agreement between model, human expert, and human novice annotators based on an established taxonomy of argument quality dimensions. Our findings highlight that LLMs can produce consistent annotations, with a moderately high agreement with human experts across most of the quality dimensions. Moreover, we show that using LLMs as additional annotators can significantly improve the agreement between annotators. These results suggest that LLMs can serve as a valuable tool for automated argument quality assessment, thus streamlining and accelerating the evaluation of large argument datasets.

Are Large Language Models Reliable Argument Quality Annotators?

TL;DR

, finding that LLMs produce more consistent annotations than humans and that PaLM 2 often aligns better with human judgments. Moreover, integrating PaLM 2 annotations with human annotations significantly improves overall agreement, indicating a practical, semi-automatic workflow to scale argument quality assessment. The results suggest two deployment modes for LLMs as annotators: fully automatic annotation and augmentation as additional annotators, which can reduce manual effort and accelerate large-scale annotation tasks, though limitations regarding prompt sensitivity and generalizability remain. The work points to future enhancements via few-shot prompting, fine-tuning, and broader open-source LLM evaluation to further improve agreement with human judgments.

Abstract

Paper Structure (19 sections, 5 figures, 5 tables)

This paper contains 19 sections, 5 figures, 5 tables.

Introduction
Related Work
Evaluating Argument Quality
LLMs as Annotators
Experimental Design
Expert Annotation
Novice Annotation
Models
Prompting
Results
Consistency of Argument Quality Annotations
RQ1: Do LLMs provide more consistent evaluations of argument quality compared to human annotators?
Agreement between Humans and LLMs
RQ2: Do the assessments of argument quality made by LLMs align with those made by either human experts or human novices?
LLMs as Additional Annotators
...and 4 more sections

Figures (5)

Figure 1: An expert prompt that contains instructions and an example issue, stance, and argument from the Dagstuhl-15512 ArgQuality corpus. This particular prompt example asks the model to rate the clarity of the argument. The reasoning variant of this prompt is colored in gray.
Figure 2: Distribution of the assigned quality ratings across all quality dimensions compared between human annotators and LLMs.
Figure 3: Inter-annotator agreement (Krippendorff's $\alpha$) between human and LLM annotations for each fine-grained argument quality dimension.
Figure 4: Inter-annotator agreement (Krippendorff's $\alpha$) between human and LLM annotations for each coarse-grained argument quality dimension.
Figure 5: Overall inter-annotator agreement (Krippendorff's $\alpha$) between each combination of human expert, novice, and LLM-generated annotations.

Theorems & Definitions (2)

definition thmcounterdefinition: Local Acceptability (Expert)
definition thmcounterdefinition: Local Acceptability (Novice)

Are Large Language Models Reliable Argument Quality Annotators?

TL;DR

Abstract

Are Large Language Models Reliable Argument Quality Annotators?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (2)