Self-Contradictory Reasoning Evaluation and Detection

Ziyi Liu; Soumya Sanyal; Isabelle Lee; Yongkang Du; Rahul Gupta; Yang Liu; Jieyu Zhao

Self-Contradictory Reasoning Evaluation and Detection

Ziyi Liu, Soumya Sanyal, Isabelle Lee, Yongkang Du, Rahul Gupta, Yang Liu, Jieyu Zhao

TL;DR

The results indicate that current LLMs lack the robustness necessary for reliable reasoning and the urgent need for establishing best practices in comprehensive reasoning evaluations beyond pure performance-based metrics is emphasized.

Abstract

In a plethora of recent work, large language models (LLMs) demonstrated impressive reasoning ability, but many proposed downstream reasoning tasks only focus on final answers. Two fundamental questions persist: 1) how consistent is the reasoning, and 2) can models detect unreliable reasoning? In this paper, we investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. To answer 1), we define and assess the Self-Contra rate across three datasets and delve into finer-grained categories of Self-Contra reasoning. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. The model may generate correct answers by taking shortcuts in reasoning or overlooking contextual evidence, leading to compromised reasoning. For 2), we task the state-of-the-art model GPT-4 with identifying Self-Contra reasoning and finer-grained fallacies. We find that finer-grained categories enhanced detection can improve GPT-4's ability to detect Self-Contra. However, it is only able to detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans. Our results indicate that current LLMs lack the robustness necessary for reliable reasoning and we emphasize the urgent need for establishing best practices in comprehensive reasoning evaluations beyond pure performance-based metrics.

Self-Contradictory Reasoning Evaluation and Detection

TL;DR

Abstract

Paper Structure (41 sections, 2 equations, 7 figures, 16 tables)

This paper contains 41 sections, 2 equations, 7 figures, 16 tables.

Introduction
Self-Contra Reasoning
Definition
Dataset
Probing Reasoning in LLMs
Zero- and Few-shot prompting
Results and Analysis
Which tasks and LLMs are prone to formulate Self-Contra reasoning?
Does accuracy correlate with SCR?
Which are the most common reasoning?
Finer-grained Categories of Self-Contra
Correct Reasoning Categories
Wrong Reasoning Categories
Results
Automatic detection
...and 26 more sections

Figures (7)

Figure 1: An example for self-contradictory reasoning and detection by LLMs. LLMs fail to generate consistent reasoning and are poor at detecting the self-contradiction.
Figure 2: Three paradigms we study: human-annotated Self-Contra reasoning evaluation, finer-grained category analysis, and finer-grained categories enhanced automatic detection of Self-Contra. We first identify the type of Self-Contra reasoning and analyze the detailed cause of the issues. Then we build automatic evaluation based on finer-grained category detection.
Figure 3: Frequency of types in WinoBias and WinoGrande datasets. The result of WinoGender dataset is shown in Appendix Sec. \ref{['sec:appendix:section3:result']}. We combine zero-shot and few-shot results. Takeaway: Type 2 reasoning accounts for a large portion of Self-Contra which could hurt users' confidence in LLMs as wrong reasoning yields correct answers.
Figure 4: Type results in WinoGender dataset.
Figure 5: Introduction of task for human detection.
...and 2 more figures

Self-Contradictory Reasoning Evaluation and Detection

TL;DR

Abstract

Self-Contradictory Reasoning Evaluation and Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)