MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Chadi Helwe; Tom Calamai; Pierre-Henri Paris; Chloé Clavel; Fabian Suchanek

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel, Fabian Suchanek

TL;DR

MAFALDA addresses fragmentation in fallacy detection by unifying four public datasets into a single benchmark with a cohesive taxonomy. It introduces a disjunctive annotation scheme to capture annotation subjectivity and a span-based evaluation framework using $C(p, l_p, g, l_g, |p|)$ and $Recall(P, G)$, including optional spans labeled as 'no-fallacy'. The dataset comprises 9,745 texts, including 200 manually annotated texts with 268 fallacious spans and accompanying explanations, released under CC-BY-SA. Zero-shot evaluation of GPT-3.5 and multiple open LLMs reveals strong performance on Level 0 but substantial gaps at Levels 1-2, underscoring the need for few-shot or top-down strategies and future methodological developments. The benchmark provides a practical, extensible resource for advancing subjective NLP tasks in fallacy detection and broader argumentation research.

Abstract

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

TL;DR

and

, including optional spans labeled as 'no-fallacy'. The dataset comprises 9,745 texts, including 200 manually annotated texts with 268 fallacious spans and accompanying explanations, released under CC-BY-SA. Zero-shot evaluation of GPT-3.5 and multiple open LLMs reveals strong performance on Level 0 but substantial gaps at Levels 1-2, underscoring the need for few-shot or top-down strategies and future methodological developments. The benchmark provides a practical, extensible resource for advancing subjective NLP tasks in fallacy detection and broader argumentation research.

Abstract

Paper Structure (66 sections, 4 theorems, 32 equations, 12 figures, 22 tables)

This paper contains 66 sections, 4 theorems, 32 equations, 12 figures, 22 tables.

Introduction
Related Work
Datasets
Subjectivity and Annotation Challenges
Taxonomies of Fallacies
A Unified Taxonomy of Fallacies
Definitions
Taxonomy of Fallacies
Tackling Subjectivity in Annotations
Subjectivity in Fallacy Annotation
Disjunctive Annotation Scheme
Evaluation Metrics.
MAFALDA Dataset
Source Datasets
Annotation
...and 51 more sections

Key Result

Proposition I.1

Given a gold standard $G$, where each span comprises only a single sentence, and where each fallacy set contains only one element, and given a prediction $P$, where each span comprises only a single sentence, our precision coincides with the standard precision.

Figures (12)

Figure 1: Examples of Fallacies. The spans of the fallacies are underlined. Example \ref{['ex:implicit']} is from jinLogicalFallacyDetection2022, \ref{['ex:political']} from goffredoFallaciousArgumentClassification2022, and \ref{['ex:finance_law_degree']} from sahaiBreakingInvisibleWall2021. Detailed annotations are in Appendix \ref{['app:additional_examples']}.
Figure 2: Tree structure of our taxonomy. Detailed definitions of the fallacies are in Appendix \ref{['app:informal_def']}.
Figure 3: List of fallacies per paper and in our taxonomy.
Figure 4: Statistics about our source datasets. The left graphic shows the vocabulary size, while the right graphic shows the average length of the texts.
Figure 5: Co-occurrence of labels (frequency)
...and 7 more figures

Theorems & Definitions (4)

Proposition I.1
Proposition I.2
Proposition I.3
Proposition I.4

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

TL;DR

Abstract

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)