Table of Contents
Fetching ...

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Yanhong Li, Chenghao Yang, Allyson Ettinger

TL;DR

This work probes whether LLM self-reflection truly enhances reasoning when external feedback and iterative prompting are removed, introducing the Single-Round Self-Reflection Verification (SR$^2$V) framework that generates $K$ candidate answers, critiques each, and revises to a final output. Across TruthfulQA and HotpotQA, and across multiple models including ChatGPT-3.5, LLaMA-2, and Mixtral, SR$^2$V yields mixed results: clear gains on TruthfulQA but detrimental effects on HotpotQA, with outcomes modulated by the model’s initial $RA$ (response accuracy) and human-annotated question difficulty. An error-analysis with artificial responses shows that self-reflection is beneficial mainly when initial answers are unreliable, and especially for harder questions; the mechanism also reduces reliance on majority voting, suggesting more nuanced decision-making. Based on these findings, the authors propose practical guidelines for when to enable self-reflection, emphasizing estimable $RA$ and difficulty signals, and they release their code to foster reproducibility. The results highlight the nuanced utility of self-reflection in real-world decision-making with LLMs and point to model- and task-specific guidelines for deployment.

Abstract

Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

TL;DR

This work probes whether LLM self-reflection truly enhances reasoning when external feedback and iterative prompting are removed, introducing the Single-Round Self-Reflection Verification (SRV) framework that generates candidate answers, critiques each, and revises to a final output. Across TruthfulQA and HotpotQA, and across multiple models including ChatGPT-3.5, LLaMA-2, and Mixtral, SRV yields mixed results: clear gains on TruthfulQA but detrimental effects on HotpotQA, with outcomes modulated by the model’s initial (response accuracy) and human-annotated question difficulty. An error-analysis with artificial responses shows that self-reflection is beneficial mainly when initial answers are unreliable, and especially for harder questions; the mechanism also reduces reliance on majority voting, suggesting more nuanced decision-making. Based on these findings, the authors propose practical guidelines for when to enable self-reflection, emphasizing estimable and difficulty signals, and they release their code to foster reproducibility. The results highlight the nuanced utility of self-reflection in real-world decision-making with LLMs and point to model- and task-specific guidelines for deployment.

Abstract

Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.
Paper Structure (32 sections, 11 figures, 4 tables)

This paper contains 32 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Example of Self-Reflection Prompting
  • Figure 2: Performance Decomposition on Question Difficulty and Response Accuracy.
  • Figure 3: Performance Decomposition on Question Difficulty and Response Accuracy (Artificial Responses). Dotted lines show "turning points" at which reflection loses effectiveness, for Easy/Medium/Hard questions.
  • Figure 4: Majority Voting Analysis
  • Figure 5: Proposed guide for using Self-Reflection.
  • ...and 6 more figures