Table of Contents
Fetching ...

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

Victor Hugo Nascimento Rocha, Igor Cataneo Silveira, Paulo Pirozelli, Denis Deratani Mauá, Fabio Gagliardi Cozman

TL;DR

This paper tackles the risk of misinformation from large language models by creating ArGPT, a dataset of good, bad, and ugly arguments generated via a teacher–student dialogue with ChatGPT. It defines five AM/AES-related tasks and provides baselines, showing that ArGPT can differentiate argument quality and generalize to human-annotated data. The results indicate that LLM-generated arguments resemble human argumentation sufficiently to train and evaluate AM and AES systems at lower cost, with end-to-end pipelines capable of producing argumentative graphs albeit with error propagation. Overall, ArGPT offers a scalable resource for developing detectors of problematic LLM-generated arguments and for training AM/AES models using synthetic yet human-aligned data.

Abstract

The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.

Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

TL;DR

This paper tackles the risk of misinformation from large language models by creating ArGPT, a dataset of good, bad, and ugly arguments generated via a teacher–student dialogue with ChatGPT. It defines five AM/AES-related tasks and provides baselines, showing that ArGPT can differentiate argument quality and generalize to human-annotated data. The results indicate that LLM-generated arguments resemble human argumentation sufficiently to train and evaluate AM and AES systems at lower cost, with end-to-end pipelines capable of producing argumentative graphs albeit with error propagation. Overall, ArGPT offers a scalable resource for developing detectors of problematic LLM-generated arguments and for training AM/AES models using synthetic yet human-aligned data.

Abstract

The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
Paper Structure (11 sections, 1 figure, 4 tables)

This paper contains 11 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Generating ArGPT. We selected several themes for argumentative essays based on false or self-contradictory ideas. We gave ChatGPT a first prompt (the student prompt), instructing it to create an argumentative essay about a selected theme. In the following round, we provided a second prompt (the professor prompt) instructing it to write an essay correcting the student's argumentation. If the produced essays did not follow our requirements, we repeated the process.