When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

Yinghui Li; Qingyu Zhou; Yuanzhen Luo; Shirong Ma; Yangning Li; Hai-Tao Zheng; Xuming Hu; Philip S. Yu

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

Yinghui Li, Qingyu Zhou, Yuanzhen Luo, Shirong Ma, Yangning Li, Hai-Tao Zheng, Xuming Hu, Philip S. Yu

TL;DR

This paper introduces FLUB, a FaLlacy Understanding Benchmark, to stress-test large language models on their ability to understand cunning, misleading, and humorous texts drawn from real online content. FLUB comprises 834 Chinese samples across 8 cunning types and 3 tasks: Answer Selection, Cunning Type Classification, and Fallacy Explanation, with both automatic and human evaluation protocols. Experimental results show that most advanced LLMs struggle with fallacy understanding, especially in类型 classification, and that Chain-of-Thought prompts do not consistently improve performance; in-context learning can help when sufficient demonstrations are provided. The work highlights a clear performance gap between humans and LLMs on fallacies and argues for continued research to strengthen LLMs’ resilience to real-world misleading language, offering the dataset and prompts at: https://github.com/THUKElab/FLUB.

Abstract

Recently, Large Language Models (LLMs) make remarkable evolutions in language understanding and generation. Following this, various benchmarks for measuring all kinds of capabilities of LLMs have sprung up. In this paper, we challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. And we design three tasks with increasing difficulty in the FLUB benchmark to evaluate the fallacy understanding ability of LLMs. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs, reflecting our FLUB is challenging and worthy of more future study. Interesting discoveries and valuable insights are achieved in our extensive experiments and detailed analyses. We hope that our benchmark can encourage the community to improve LLMs' ability to understand fallacies. Our data and codes are available at https://github.com/THUKElab/FLUB.

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

TL;DR

Abstract

Paper Structure (95 sections, 6 figures, 3 tables)

This paper contains 95 sections, 6 figures, 3 tables.

Introduction
The FLUB Benchmark
Benchmark Construction
Data Collection
Data Cleaning
Data Annotation
Dataset Analysis
Data Size
Data Distribution
Annotation Quality
Benchmark Task Setups
Task 1: Answer Selection
Task 2: Cunning Type Classification
Task 3: Fallacy Explanation
Automatic Evaluation Metrics
...and 80 more sections

Figures (6)

Figure 1: The running examples and annotation examples of FLUB.
Figure 2: The definitions and examples of the cunning types in FLUB.
Figure 3: The results of in-context learning with 0/1/2/5-shots demonstrations.
Figure 4: Our designed prompts without the Chain-of-Thought idea. Task 3(a) is for the texts that are not expressed in the form of inquiries. Task 3(b) is for inquiries.
Figure 5: Our designed prompts with the Chain-of-Thought idea. Task 3(a) is for the texts that are not expressed in the form of inquiries. Task 3(b) is for inquiries.
...and 1 more figures

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

TL;DR

Abstract

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)