Large Language Models in Argument Mining: A Survey
Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
TL;DR
This survey synthesises Argument Mining (AM) in the era of Large Language Models (LLMs), reframing AM as an interdependent system where prompts, retrieval, and reasoning reshape tasks, data, and evaluation. It documents the shift from task-specific datasets to integrated corpora, outlines LLM-assisted creation, and analyses architectural patterns such as instruction tuning and retrieval-augmented generation. Key contributions include a consolidated dataset landscape, reconceptualised task formulations with modern evaluation paradigms, and a forward-looking research agenda addressing long-context reasoning, multimodal/multilingual robustness, and bias auditing. The work highlights both opportunities and risks—cost, provenance, bias, and evaluation circularity—arguing for interpretable, human-centred, and theory-grounded AM that scales to real-world argumentative domains. Overall, the paper charts a roadmap for robust, transparent, and scalable LLM-driven computational argumentation.
Abstract
Large Language Models (LLMs) have fundamentally reshaped Argument Mining (AM), shifting it from a pipeline of supervised, task-specific classifiers to a spectrum of prompt-driven, retrieval-augmented, and reasoning-oriented paradigms. Yet existing surveys largely predate this transition, leaving unclear how LLMs alter task formulations, dataset design, evaluation methodology, and the theoretical foundations of computational argumentation. In this survey, we synthesise research and provide the first unified account of AM in the LLM era. We revisit canonical AM subtasks, i.e., claim and evidence detection, relation prediction, stance classification, argument quality assessment, and argumentative summarisation, and show how prompting, chain-of-thought reasoning, and in-context learning blur traditional task boundaries. We catalogue the rapid evolution of resources, including integrated multi-layer corpora and LLM-assisted annotation pipelines that introduce new opportunities as well as risks of bias and evaluation circularity. Building on this mapping, we identify emerging architectural patterns across LLM-based AM systems and consolidate evaluation practices spanning component-level accuracy, soft-label quality assessment, and LLM-judge reliability. Finally, we outline persistent challenges, including long-context reasoning, multimodal and multilingual robustness, interpretability, and cost-efficient deployment, and propose a forward-looking research agenda for LLM-driven computational argumentation.
