Table of Contents
Fetching ...

It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

Nishant Balepur, Shramay Palta, Rachel Rudinger

TL;DR

This paper investigates whether large language models can reason toward incorrect options using a process-of-elimination (PoE) approach combined with chain-of-thought (COT) prompting in two-choice MCQA settings. It benchmarks GPT-3.5, LLaMA-2, and Falcon across four commonsense and scientific datasets, finding that PoE with COT consistently underperforms direct answer (DA) strategies and exhibits lower self-consistency between strategies. Error analyses show PoE-COT failures mainly stem from reasoning mistakes and misaligned rationales, with negation proving particularly challenging, and iterative PoE exhibits error propagation that undermines reliability. The study highlights PoE with COT as a potential interpretability and diagnostic tool but concludes it is not yet robust enough for practical deployment, while suggesting future work in combining DA/PoE reasoning, benchmarking robustness, and selectively fine-tuning PoE rationales to improve performance and reliability.

Abstract

Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.

It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

TL;DR

This paper investigates whether large language models can reason toward incorrect options using a process-of-elimination (PoE) approach combined with chain-of-thought (COT) prompting in two-choice MCQA settings. It benchmarks GPT-3.5, LLaMA-2, and Falcon across four commonsense and scientific datasets, finding that PoE with COT consistently underperforms direct answer (DA) strategies and exhibits lower self-consistency between strategies. Error analyses show PoE-COT failures mainly stem from reasoning mistakes and misaligned rationales, with negation proving particularly challenging, and iterative PoE exhibits error propagation that undermines reliability. The study highlights PoE with COT as a potential interpretability and diagnostic tool but concludes it is not yet robust enough for practical deployment, while suggesting future work in combining DA/PoE reasoning, benchmarking robustness, and selectively fine-tuning PoE rationales to improve performance and reliability.

Abstract

Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.
Paper Structure (27 sections, 13 figures, 17 tables)

This paper contains 27 sections, 13 figures, 17 tables.

Figures (13)

  • Figure 1: ChatGPT using direct answer and process of elimination strategies via chain-of-thought prompting.
  • Figure 2: Accuracy of Direct Answer and Process of Elimination, with and without chain-of-thought, on commonsense (CQA, SIQA) and scientific (ARC, OpenBookQA) reasoning datasets. Numerical results are in Appendix \ref{['appendix:numerical_eval']}.
  • Figure 3: Error distribution of PoE COT on ARC/CQA.
  • Figure 4: Accuracy of iterative PoE with each iteration.
  • Figure 5: Error distribution of PoE COT and DA COT on ARC and Commonsense QA
  • ...and 8 more figures