Table of Contents
Fetching ...

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, Talal Rahwan

TL;DR

Self-reflection in LLMs shows mixed effects on reasoning, with prompts and model choice heavily influencing outcomes, and often fails to surpass standard chain-of-thought prompting. Strikingly, self-reflection yields substantial improvements in safety, gender neutrality, and ideological neutrality across proprietary models (GPT-4o-mini, Gemini-1.5-Flash) and open models (Llama-3.2-3B) when evaluated on dedicated datasets and task prompts. The study also demonstrates cross-reflection, where one model critiques another, yielding further gains in safety and neutrality while preserving usefulness. The authors provide public release of datasets and code, underscoring test-time, multi-objective capabilities as a practical deployment strategy for safer, fairer LLMs.

Abstract

Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt wording -- both the original prompt that are used to elicit an initial answer and the subsequent prompt used to self-reflect. Specifically, although self-reflection may improve the reasoning capability of LLMs when the initial response is simple, the technique cannot improve upon the state-of-the-art chain-of-thought (CoT) prompting. Second, we show that self-reflection can lead to safer (75.8\% reduction in toxic responses while preserving 97.8\% non-toxic ones), less biased (77\% reduction in gender biased responses, while preserving 94.3\% unbiased ones), and more ideologically neutral responses (100\% reduction in partisan leaning response, while preserving 87.7\% non-partisan ones). The paper concludes by discussing the implications of our findings on the deployment of large language models. We release our experiments at https://github.com/Michael98Liu/self-reflection.

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

TL;DR

Self-reflection in LLMs shows mixed effects on reasoning, with prompts and model choice heavily influencing outcomes, and often fails to surpass standard chain-of-thought prompting. Strikingly, self-reflection yields substantial improvements in safety, gender neutrality, and ideological neutrality across proprietary models (GPT-4o-mini, Gemini-1.5-Flash) and open models (Llama-3.2-3B) when evaluated on dedicated datasets and task prompts. The study also demonstrates cross-reflection, where one model critiques another, yielding further gains in safety and neutrality while preserving usefulness. The authors provide public release of datasets and code, underscoring test-time, multi-objective capabilities as a practical deployment strategy for safer, fairer LLMs.

Abstract

Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt wording -- both the original prompt that are used to elicit an initial answer and the subsequent prompt used to self-reflect. Specifically, although self-reflection may improve the reasoning capability of LLMs when the initial response is simple, the technique cannot improve upon the state-of-the-art chain-of-thought (CoT) prompting. Second, we show that self-reflection can lead to safer (75.8\% reduction in toxic responses while preserving 97.8\% non-toxic ones), less biased (77\% reduction in gender biased responses, while preserving 94.3\% unbiased ones), and more ideologically neutral responses (100\% reduction in partisan leaning response, while preserving 87.7\% non-partisan ones). The paper concludes by discussing the implications of our findings on the deployment of large language models. We release our experiments at https://github.com/Michael98Liu/self-reflection.
Paper Structure (17 sections, 16 figures, 16 tables)

This paper contains 17 sections, 16 figures, 16 tables.

Figures (16)

  • Figure 1: The accuracy on MEDQA-USMLE dataset before and after self-reflection. The accuracy before self-reflection is denoted as "original answer", and the rest correspond to accuracies after self-reflection using one of the five prompts (Appendix Figure \ref{['fig:prompt-reasoning']}). Panel on the left correspond to self-reflections on initial responses obtained using simple initial response without CoT, while the panel on the right correspond to self-reflection on initial responses with CoT. The temperature value is set to 1 for text generation.
  • Figure 2: TPR, TNR, and overall accuracy after self-reflection using the safety dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-safety']} for each of the seven prompts used in experiments.
  • Figure 3: TPR, TNR, and overall accuracy after self-reflection using the gender bias dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-gender']} for each of the four prompts used in experiments.
  • Figure 4: TPR, TNR, and overall accuracy after self-reflection using the ideological leaning dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-ideol']} for each of the four prompts used in experiments.
  • Figure 5: The percentage of initial answers that are changed after self-reflection using GSM8K. The x-axis shows the percentage among answers that are initially wrong, while the y-axis shows the percentage among those that are initially correct.
  • ...and 11 more figures