Table of Contents
Fetching ...

Large Language Models in Argument Mining: A Survey

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

TL;DR

This survey synthesises Argument Mining (AM) in the era of Large Language Models (LLMs), reframing AM as an interdependent system where prompts, retrieval, and reasoning reshape tasks, data, and evaluation. It documents the shift from task-specific datasets to integrated corpora, outlines LLM-assisted creation, and analyses architectural patterns such as instruction tuning and retrieval-augmented generation. Key contributions include a consolidated dataset landscape, reconceptualised task formulations with modern evaluation paradigms, and a forward-looking research agenda addressing long-context reasoning, multimodal/multilingual robustness, and bias auditing. The work highlights both opportunities and risks—cost, provenance, bias, and evaluation circularity—arguing for interpretable, human-centred, and theory-grounded AM that scales to real-world argumentative domains. Overall, the paper charts a roadmap for robust, transparent, and scalable LLM-driven computational argumentation.

Abstract

Large Language Models (LLMs) have fundamentally reshaped Argument Mining (AM), shifting it from a pipeline of supervised, task-specific classifiers to a spectrum of prompt-driven, retrieval-augmented, and reasoning-oriented paradigms. Yet existing surveys largely predate this transition, leaving unclear how LLMs alter task formulations, dataset design, evaluation methodology, and the theoretical foundations of computational argumentation. In this survey, we synthesise research and provide the first unified account of AM in the LLM era. We revisit canonical AM subtasks, i.e., claim and evidence detection, relation prediction, stance classification, argument quality assessment, and argumentative summarisation, and show how prompting, chain-of-thought reasoning, and in-context learning blur traditional task boundaries. We catalogue the rapid evolution of resources, including integrated multi-layer corpora and LLM-assisted annotation pipelines that introduce new opportunities as well as risks of bias and evaluation circularity. Building on this mapping, we identify emerging architectural patterns across LLM-based AM systems and consolidate evaluation practices spanning component-level accuracy, soft-label quality assessment, and LLM-judge reliability. Finally, we outline persistent challenges, including long-context reasoning, multimodal and multilingual robustness, interpretability, and cost-efficient deployment, and propose a forward-looking research agenda for LLM-driven computational argumentation.

Large Language Models in Argument Mining: A Survey

TL;DR

This survey synthesises Argument Mining (AM) in the era of Large Language Models (LLMs), reframing AM as an interdependent system where prompts, retrieval, and reasoning reshape tasks, data, and evaluation. It documents the shift from task-specific datasets to integrated corpora, outlines LLM-assisted creation, and analyses architectural patterns such as instruction tuning and retrieval-augmented generation. Key contributions include a consolidated dataset landscape, reconceptualised task formulations with modern evaluation paradigms, and a forward-looking research agenda addressing long-context reasoning, multimodal/multilingual robustness, and bias auditing. The work highlights both opportunities and risks—cost, provenance, bias, and evaluation circularity—arguing for interpretable, human-centred, and theory-grounded AM that scales to real-world argumentative domains. Overall, the paper charts a roadmap for robust, transparent, and scalable LLM-driven computational argumentation.

Abstract

Large Language Models (LLMs) have fundamentally reshaped Argument Mining (AM), shifting it from a pipeline of supervised, task-specific classifiers to a spectrum of prompt-driven, retrieval-augmented, and reasoning-oriented paradigms. Yet existing surveys largely predate this transition, leaving unclear how LLMs alter task formulations, dataset design, evaluation methodology, and the theoretical foundations of computational argumentation. In this survey, we synthesise research and provide the first unified account of AM in the LLM era. We revisit canonical AM subtasks, i.e., claim and evidence detection, relation prediction, stance classification, argument quality assessment, and argumentative summarisation, and show how prompting, chain-of-thought reasoning, and in-context learning blur traditional task boundaries. We catalogue the rapid evolution of resources, including integrated multi-layer corpora and LLM-assisted annotation pipelines that introduce new opportunities as well as risks of bias and evaluation circularity. Building on this mapping, we identify emerging architectural patterns across LLM-based AM systems and consolidate evaluation practices spanning component-level accuracy, soft-label quality assessment, and LLM-judge reliability. Finally, we outline persistent challenges, including long-context reasoning, multimodal and multilingual robustness, interpretability, and cost-efficient deployment, and propose a forward-looking research agenda for LLM-driven computational argumentation.

Paper Structure

This paper contains 98 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Key large language models and techniques influencing argument mining.
  • Figure 2: Paper-collection workflow and resulting dataset.
  • Figure 3: End-to-end workflow for constructing debate-ready argument summaries. This pipeline illustrates the hierarchical progression from Step 1: Claim Detection, where sentence-level argumentative units are identified, through Step 2: Stance Detection and Step 3: Key Point Analysis, which organize claims by polarity and cluster them into coherent key points. Step 4: Evidence Detection retrieves supporting or opposing evidence for each key point, enabling Step 5: Argument Summarization to generate concise, structured argumentative summaries. Finally, Step 6: Argumentative Assessment evaluates the completeness, coherence, and quality of the produced arguments. The Sankey-style flow visualizes how claims, stances, evidence, and key points propagate through the pipeline to form coherent debate scripts.
  • Figure 4: Overview of quality dimensions for argument mining quality assessment discovered in the surveyed literature. Note: The figure is an extension of a taxonomy proposed in DBLP:conf/emnlp/IvanovaHN24