When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
Yanhong Li, Chenghao Yang, Allyson Ettinger
TL;DR
This work probes whether LLM self-reflection truly enhances reasoning when external feedback and iterative prompting are removed, introducing the Single-Round Self-Reflection Verification (SR$^2$V) framework that generates $K$ candidate answers, critiques each, and revises to a final output. Across TruthfulQA and HotpotQA, and across multiple models including ChatGPT-3.5, LLaMA-2, and Mixtral, SR$^2$V yields mixed results: clear gains on TruthfulQA but detrimental effects on HotpotQA, with outcomes modulated by the model’s initial $RA$ (response accuracy) and human-annotated question difficulty. An error-analysis with artificial responses shows that self-reflection is beneficial mainly when initial answers are unreliable, and especially for harder questions; the mechanism also reduces reliance on majority voting, suggesting more nuanced decision-making. Based on these findings, the authors propose practical guidelines for when to enable self-reflection, emphasizing estimable $RA$ and difficulty signals, and they release their code to foster reproducibility. The results highlight the nuanced utility of self-reflection in real-world decision-making with LLMs and point to model- and task-specific guidelines for deployment.
Abstract
Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.
