Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

Victor Hugo Nascimento Rocha; Igor Cataneo Silveira; Paulo Pirozelli; Denis Deratani Mauá; Fabio Gagliardi Cozman

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

Victor Hugo Nascimento Rocha, Igor Cataneo Silveira, Paulo Pirozelli, Denis Deratani Mauá, Fabio Gagliardi Cozman

TL;DR

This paper tackles the risk of misinformation from large language models by creating ArGPT, a dataset of good, bad, and ugly arguments generated via a teacher–student dialogue with ChatGPT. It defines five AM/AES-related tasks and provides baselines, showing that ArGPT can differentiate argument quality and generalize to human-annotated data. The results indicate that LLM-generated arguments resemble human argumentation sufficiently to train and evaluate AM and AES systems at lower cost, with end-to-end pipelines capable of producing argumentative graphs albeit with error propagation. Overall, ArGPT offers a scalable resource for developing detectors of problematic LLM-generated arguments and for training AM/AES models using synthetic yet human-aligned data.

Abstract

The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

TL;DR

Abstract

Paper Structure (11 sections, 1 figure, 4 tables)

This paper contains 11 sections, 1 figure, 4 tables.

Introduction
Background
Argument(ation) Mining
Automatic Essay Scoring
Generating Argumentative Essays with ChatGPT
ArGPT: Dataset Annotation and Statistics
Using ArGPT: Supported Tasks and Their Baselines
Evaluation Metrics
Results and Discussion
The Connection with Human Argumentation
Conclusions and Future Work

Figures (1)

Figure 1: Generating ArGPT. We selected several themes for argumentative essays based on false or self-contradictory ideas. We gave ChatGPT a first prompt (the student prompt), instructing it to create an argumentative essay about a selected theme. In the following round, we provided a second prompt (the professor prompt) instructing it to write an essay correcting the student's argumentation. If the produced essays did not follow our requirements, we repeated the process.

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

TL;DR

Abstract

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (1)