Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives
Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu
TL;DR
This paper tackles the unreliability of intrinsic self-reflection in LLMs by identifying biased self-evaluation as the main bottleneck. It introduces Self-Contrast, a three-step process that generates diverse solving perspectives, contrasts their differences, and produces a discrepancy-based checklist to guide self-correction. Empirical results across math reasoning and creative translation show significant and consistent improvements over vanilla reflection and competitive baselines, with strong generalization across GPT-3.5, GPT-4, and Llama-2 models. The work demonstrates that contrastive, multi-perspective reflection can yield more accurate and stable self-correction, offering a scalable approach to enhance reasoning in LLMs without external feedback.
Abstract
The reflection capacity of Large Language Model (LLM) has garnered extensive attention. A post-hoc prompting strategy, e.g., reflexion and self-refine, refines LLM's response based on self-evaluated or external feedback. However, recent research indicates without external feedback, LLM's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We find LLMs often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. To remedy this, we advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. Our method endows LLM with diverse perspectives to alleviate stubborn biases. Moreover, their discrepancies indicate potential errors or inherent uncertainties that LLM often overlooks. Reflecting upon these can catalyze more accurate and stable reflection. Experiments conducted on a series of reasoning and translation tasks with different LLMs serve to underscore the effectiveness and generality of our strategy.
