Table of Contents
Fetching ...

Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation

Hao Li, Yuping Wu, Viktor Schlegel, Riza Batista-Navarro, Tharindu Madusanka, Iqra Zahid, Jiayan Zeng, Xiaochi Wang, Xinran He, Yizhi Li, Goran Nenadic

TL;DR

This work introduces ASE, an end-to-end argument mining dataset that unifies four tasks—evidence detection (ED), evidence convincingness ranking (ECR), argument summarisation (AS), and summary quality evaluation (SQE)—with an additional ASR variant for ranking generated summaries. By collecting evidence across 31 topics, generating model-based summaries with diverse inputs, and obtaining multi-faceted human judgments, ASE enables comprehensive evaluation of end-to-end debate preparation pipelines. Baseline experiments across classification, contrastive learning, and summarisation show strong performance on isolated tasks but substantial performance degradation when tasks are composed end-to-end, underscoring the need for integrated models and higher-quality evaluation data. The authors also demonstrate correlations between automated metrics and human judgments, and they open-source the dataset and benchmarking code to spur further research into robust, human-aligned end-to-end argument mining and summarisation systems.

Abstract

With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at https://github.com/HaoBytes/ArgSum-Datatset

Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation

TL;DR

This work introduces ASE, an end-to-end argument mining dataset that unifies four tasks—evidence detection (ED), evidence convincingness ranking (ECR), argument summarisation (AS), and summary quality evaluation (SQE)—with an additional ASR variant for ranking generated summaries. By collecting evidence across 31 topics, generating model-based summaries with diverse inputs, and obtaining multi-faceted human judgments, ASE enables comprehensive evaluation of end-to-end debate preparation pipelines. Baseline experiments across classification, contrastive learning, and summarisation show strong performance on isolated tasks but substantial performance degradation when tasks are composed end-to-end, underscoring the need for integrated models and higher-quality evaluation data. The authors also demonstrate correlations between automated metrics and human judgments, and they open-source the dataset and benchmarking code to spur further research into robust, human-aligned end-to-end argument mining and summarisation systems.

Abstract

With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at https://github.com/HaoBytes/ArgSum-Datatset
Paper Structure (22 sections, 4 equations, 2 figures, 15 tables)

This paper contains 22 sections, 4 equations, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Overview of proposed annotation pipeline, which includes four main tasks. Task 1 identifies whether a snippet is an evidence for a given claim; Task 2 selects the appropriate evidence for each claim to make it the most persuasive; Task 3 generates a diversity of debate scripts for a given debate topic and stance then ranks them according to human preference, with the dimension of quality being measured in Task 4.
  • Figure 2: Output of narrative generation of Project Debater