When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Yanhong Li; Chenghao Yang; Allyson Ettinger

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Yanhong Li, Chenghao Yang, Allyson Ettinger

TL;DR

This work probes whether LLM self-reflection truly enhances reasoning when external feedback and iterative prompting are removed, introducing the Single-Round Self-Reflection Verification (SR$^2$V) framework that generates $K$ candidate answers, critiques each, and revises to a final output. Across TruthfulQA and HotpotQA, and across multiple models including ChatGPT-3.5, LLaMA-2, and Mixtral, SR$^2$V yields mixed results: clear gains on TruthfulQA but detrimental effects on HotpotQA, with outcomes modulated by the model’s initial $RA$ (response accuracy) and human-annotated question difficulty. An error-analysis with artificial responses shows that self-reflection is beneficial mainly when initial answers are unreliable, and especially for harder questions; the mechanism also reduces reliance on majority voting, suggesting more nuanced decision-making. Based on these findings, the authors propose practical guidelines for when to enable self-reflection, emphasizing estimable $RA$ and difficulty signals, and they release their code to foster reproducibility. The results highlight the nuanced utility of self-reflection in real-world decision-making with LLMs and point to model- and task-specific guidelines for deployment.

Abstract

Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

TL;DR

This work probes whether LLM self-reflection truly enhances reasoning when external feedback and iterative prompting are removed, introducing the Single-Round Self-Reflection Verification (SR

V) framework that generates

candidate answers, critiques each, and revises to a final output. Across TruthfulQA and HotpotQA, and across multiple models including ChatGPT-3.5, LLaMA-2, and Mixtral, SR

V yields mixed results: clear gains on TruthfulQA but detrimental effects on HotpotQA, with outcomes modulated by the model’s initial

(response accuracy) and human-annotated question difficulty. An error-analysis with artificial responses shows that self-reflection is beneficial mainly when initial answers are unreliable, and especially for harder questions; the mechanism also reduces reliance on majority voting, suggesting more nuanced decision-making. Based on these findings, the authors propose practical guidelines for when to enable self-reflection, emphasizing estimable

and difficulty signals, and they release their code to foster reproducibility. The results highlight the nuanced utility of self-reflection in real-world decision-making with LLMs and point to model- and task-specific guidelines for deployment.

Abstract

Paper Structure (32 sections, 11 figures, 4 tables)

This paper contains 32 sections, 11 figures, 4 tables.

Introduction
Self-Reflection Prompting
Preliminary Study: Does Self-Reflection Prompting Work Under SR$^2$V?
Experiment Setup
Observations
Why Self-Reflection May Not Work?
Error Analysis via Artificial Response
Effects on majority voting
Discussion
Conclusion
Accuracy Decomposition over 4 responses
Artificial Response Generation
Conditional Prompting Results
Evaluation details for TruthfulQA
Prompts used in Experiment
...and 17 more sections

Figures (11)

Figure 1: Example of Self-Reflection Prompting
Figure 2: Performance Decomposition on Question Difficulty and Response Accuracy.
Figure 3: Performance Decomposition on Question Difficulty and Response Accuracy (Artificial Responses). Dotted lines show "turning points" at which reflection loses effectiveness, for Easy/Medium/Hard questions.
Figure 4: Majority Voting Analysis
Figure 5: Proposed guide for using Self-Reflection.
...and 6 more figures

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

TL;DR

Abstract

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)