LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments
Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
TL;DR
This study systematically evaluates four state-of-the-art LLMs on three argument mining tasks—binary detection, span extraction, and relationship classification—using COMARG and YRU datasets of online comments across six controversial topics. Fine-tuned LLMs, especially Llama3 with LoRA, achieve the strongest performance on detection and extraction, but performance gains come with substantial environmental costs and notable error patterns in long or emotionally charged text. The work provides a comprehensive error analysis, revealing a tendency to over-predict argumentative content in emotionally charged comments and to struggle with long, nuanced arguments, highlighting practical limits for applications like content moderation. Overall, the results demonstrate the promise of LLMs for pre-defined-argument mining while identifying key limitations and directions for future improvements in prompting, fine-tuning, and dataset design.
Abstract
Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.
