Table of Contents
Fetching ...

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu

TL;DR

This paper tackles the unreliability of intrinsic self-reflection in LLMs by identifying biased self-evaluation as the main bottleneck. It introduces Self-Contrast, a three-step process that generates diverse solving perspectives, contrasts their differences, and produces a discrepancy-based checklist to guide self-correction. Empirical results across math reasoning and creative translation show significant and consistent improvements over vanilla reflection and competitive baselines, with strong generalization across GPT-3.5, GPT-4, and Llama-2 models. The work demonstrates that contrastive, multi-perspective reflection can yield more accurate and stable self-correction, offering a scalable approach to enhance reasoning in LLMs without external feedback.

Abstract

The reflection capacity of Large Language Model (LLM) has garnered extensive attention. A post-hoc prompting strategy, e.g., reflexion and self-refine, refines LLM's response based on self-evaluated or external feedback. However, recent research indicates without external feedback, LLM's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We find LLMs often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. To remedy this, we advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. Our method endows LLM with diverse perspectives to alleviate stubborn biases. Moreover, their discrepancies indicate potential errors or inherent uncertainties that LLM often overlooks. Reflecting upon these can catalyze more accurate and stable reflection. Experiments conducted on a series of reasoning and translation tasks with different LLMs serve to underscore the effectiveness and generality of our strategy.

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

TL;DR

This paper tackles the unreliability of intrinsic self-reflection in LLMs by identifying biased self-evaluation as the main bottleneck. It introduces Self-Contrast, a three-step process that generates diverse solving perspectives, contrasts their differences, and produces a discrepancy-based checklist to guide self-correction. Empirical results across math reasoning and creative translation show significant and consistent improvements over vanilla reflection and competitive baselines, with strong generalization across GPT-3.5, GPT-4, and Llama-2 models. The work demonstrates that contrastive, multi-perspective reflection can yield more accurate and stable self-correction, offering a scalable approach to enhance reasoning in LLMs without external feedback.

Abstract

The reflection capacity of Large Language Model (LLM) has garnered extensive attention. A post-hoc prompting strategy, e.g., reflexion and self-refine, refines LLM's response based on self-evaluated or external feedback. However, recent research indicates without external feedback, LLM's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We find LLMs often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. To remedy this, we advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. Our method endows LLM with diverse perspectives to alleviate stubborn biases. Moreover, their discrepancies indicate potential errors or inherent uncertainties that LLM often overlooks. Reflecting upon these can catalyze more accurate and stable reflection. Experiments conducted on a series of reasoning and translation tasks with different LLMs serve to underscore the effectiveness and generality of our strategy.
Paper Structure (38 sections, 6 figures, 8 tables)

This paper contains 38 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: LLMs evaluate the initial response and provide feedback for revision. However, most erroneous responses remain uncorrected after reflection as the feedback is either overconfident (46.7%) or inconsistent (45.7%). Bottom: Self-Contrast explores multiple solving perspectives, and contrast their differences, and summarize them into insightful checklist for self-correction.
  • Figure 2: Self-Contrast designs diverse prompts for different solving perspectives and generates corresponding results. Then we filter out similar results and select those that are significantly different. To inspire reflection, we contrast the differences between selected results and prompt LLM to summarize a checklist. This checklist can be used to re-examine and eliminate discrepancies. Lastly, LLM revises each response to achieve a consistent result.
  • Figure 3: Left: The distribution of the prompt number generated when Self-curated. Right: We visualize the top-20 keywords and frequencies in the prompt name.
  • Figure A1: The Reflection Accuracy with Different LLM for Initial Response. Left: different LLMs provide initial responses when GPT3.5 is utilized for Evaluation and Revision. Center: different LLMs provide initial responses when Llama2-70B is utilized for Evaluation and Revision. Right: different LLMs provide initial responses when Llama2-13B is utilized for Evaluation and Revision. The results indicate that LLMs are easily influenced during reflection. LLM is predisposed to trust previous responses over diligently examining and correcting errors.
  • Figure A2: We replace the self-curated prompt process with a simple strategy: directly sampling top-n responses for contrast. We observe that as N increases, the performance also improves, yet it still remains lower than self-contrast with the self-curated prompts. All results are conducted on GSM8K using GPT-3.5.
  • ...and 1 more figures