Table of Contents
Fetching ...

LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann

TL;DR

This study systematically evaluates four state-of-the-art LLMs on three argument mining tasks—binary detection, span extraction, and relationship classification—using COMARG and YRU datasets of online comments across six controversial topics. Fine-tuned LLMs, especially Llama3 with LoRA, achieve the strongest performance on detection and extraction, but performance gains come with substantial environmental costs and notable error patterns in long or emotionally charged text. The work provides a comprehensive error analysis, revealing a tendency to over-predict argumentative content in emotionally charged comments and to struggle with long, nuanced arguments, highlighting practical limits for applications like content moderation. Overall, the results demonstrate the promise of LLMs for pre-defined-argument mining while identifying key limitations and directions for future improvements in prompting, fine-tuning, and dataset design.

Abstract

Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.

LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

TL;DR

This study systematically evaluates four state-of-the-art LLMs on three argument mining tasks—binary detection, span extraction, and relationship classification—using COMARG and YRU datasets of online comments across six controversial topics. Fine-tuned LLMs, especially Llama3 with LoRA, achieve the strongest performance on detection and extraction, but performance gains come with substantial environmental costs and notable error patterns in long or emotionally charged text. The work provides a comprehensive error analysis, revealing a tendency to over-predict argumentative content in emotionally charged comments and to struggle with long, nuanced arguments, highlighting practical limits for applications like content moderation. Overall, the results demonstrate the promise of LLMs for pre-defined-argument mining while identifying key limitations and directions for future improvements in prompting, fine-tuning, and dataset design.

Abstract

Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.

Paper Structure

This paper contains 36 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: An online comment (top) which makes use of two pre-defined arguments (red and green boxes). The comment attacks A1 (left) and supports A2 (right).
  • Figure 2: A comment (top, left) and pre-defined argument (bottom, left). We predict whether the comment makes use of the argument (Task 1), where it mentions the argument (Task 2) and whether it supports or attacks the argument (Task 3).
  • Figure 3: Proportion of false positive and false negative errors for Pro and Con arguments in each dataset.
  • Figure 4: The effect of comment length on comment identification accuracy (Task 1; Violin/box plots) and argument extraction (Task 2; Rouge-L).