Table of Contents
Fetching ...

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel, Fabian Suchanek

TL;DR

MAFALDA addresses fragmentation in fallacy detection by unifying four public datasets into a single benchmark with a cohesive taxonomy. It introduces a disjunctive annotation scheme to capture annotation subjectivity and a span-based evaluation framework using $C(p, l_p, g, l_g, |p|)$ and $Recall(P, G)$, including optional spans labeled as 'no-fallacy'. The dataset comprises 9,745 texts, including 200 manually annotated texts with 268 fallacious spans and accompanying explanations, released under CC-BY-SA. Zero-shot evaluation of GPT-3.5 and multiple open LLMs reveals strong performance on Level 0 but substantial gaps at Levels 1-2, underscoring the need for few-shot or top-down strategies and future methodological developments. The benchmark provides a practical, extensible resource for advancing subjective NLP tasks in fallacy detection and broader argumentation research.

Abstract

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

TL;DR

MAFALDA addresses fragmentation in fallacy detection by unifying four public datasets into a single benchmark with a cohesive taxonomy. It introduces a disjunctive annotation scheme to capture annotation subjectivity and a span-based evaluation framework using and , including optional spans labeled as 'no-fallacy'. The dataset comprises 9,745 texts, including 200 manually annotated texts with 268 fallacious spans and accompanying explanations, released under CC-BY-SA. Zero-shot evaluation of GPT-3.5 and multiple open LLMs reveals strong performance on Level 0 but substantial gaps at Levels 1-2, underscoring the need for few-shot or top-down strategies and future methodological developments. The benchmark provides a practical, extensible resource for advancing subjective NLP tasks in fallacy detection and broader argumentation research.

Abstract

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.
Paper Structure (66 sections, 4 theorems, 32 equations, 12 figures, 22 tables)

This paper contains 66 sections, 4 theorems, 32 equations, 12 figures, 22 tables.

Key Result

Proposition I.1

Given a gold standard $G$, where each span comprises only a single sentence, and where each fallacy set contains only one element, and given a prediction $P$, where each span comprises only a single sentence, our precision coincides with the standard precision.

Figures (12)

  • Figure 1: Examples of Fallacies. The spans of the fallacies are underlined. Example \ref{['ex:implicit']} is from jinLogicalFallacyDetection2022, \ref{['ex:political']} from goffredoFallaciousArgumentClassification2022, and \ref{['ex:finance_law_degree']} from sahaiBreakingInvisibleWall2021. Detailed annotations are in Appendix \ref{['app:additional_examples']}.
  • Figure 2: Tree structure of our taxonomy. Detailed definitions of the fallacies are in Appendix \ref{['app:informal_def']}.
  • Figure 3: List of fallacies per paper and in our taxonomy.
  • Figure 4: Statistics about our source datasets. The left graphic shows the vocabulary size, while the right graphic shows the average length of the texts.
  • Figure 5: Co-occurrence of labels (frequency)
  • ...and 7 more figures

Theorems & Definitions (4)

  • Proposition I.1
  • Proposition I.2
  • Proposition I.3
  • Proposition I.4