Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

Fengyuan Liu; Nouar AlDahoul; Gregory Eady; Yasir Zaki; Talal Rahwan

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, Talal Rahwan

TL;DR

Self-reflection in LLMs shows mixed effects on reasoning, with prompts and model choice heavily influencing outcomes, and often fails to surpass standard chain-of-thought prompting. Strikingly, self-reflection yields substantial improvements in safety, gender neutrality, and ideological neutrality across proprietary models (GPT-4o-mini, Gemini-1.5-Flash) and open models (Llama-3.2-3B) when evaluated on dedicated datasets and task prompts. The study also demonstrates cross-reflection, where one model critiques another, yielding further gains in safety and neutrality while preserving usefulness. The authors provide public release of datasets and code, underscoring test-time, multi-objective capabilities as a practical deployment strategy for safer, fairer LLMs.

Abstract

Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt wording -- both the original prompt that are used to elicit an initial answer and the subsequent prompt used to self-reflect. Specifically, although self-reflection may improve the reasoning capability of LLMs when the initial response is simple, the technique cannot improve upon the state-of-the-art chain-of-thought (CoT) prompting. Second, we show that self-reflection can lead to safer (75.8\% reduction in toxic responses while preserving 97.8\% non-toxic ones), less biased (77\% reduction in gender biased responses, while preserving 94.3\% unbiased ones), and more ideologically neutral responses (100\% reduction in partisan leaning response, while preserving 87.7\% non-partisan ones). The paper concludes by discussing the implications of our findings on the deployment of large language models. We release our experiments at https://github.com/Michael98Liu/self-reflection.

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

TL;DR

Abstract

Paper Structure (17 sections, 16 figures, 16 tables)

This paper contains 17 sections, 16 figures, 16 tables.

Introduction
Background and Related Work
Data and Experiments
Reasoning Datasets
Safety Dataset
Gender Bias Dataset
Ideological Leaning Dataset
Experiment Setup
Evaluations
Self-Reflection Marginally Improves Reasoning Capability
Self-Reflection Improves Safety
Self-Reflection Reduces Gender Bias
Self-Reflection Improves Partisan Neutrality
Cross-Reflection
Summary of Evaluations
...and 2 more sections

Figures (16)

Figure 1: The accuracy on MEDQA-USMLE dataset before and after self-reflection. The accuracy before self-reflection is denoted as "original answer", and the rest correspond to accuracies after self-reflection using one of the five prompts (Appendix Figure \ref{['fig:prompt-reasoning']}). Panel on the left correspond to self-reflections on initial responses obtained using simple initial response without CoT, while the panel on the right correspond to self-reflection on initial responses with CoT. The temperature value is set to 1 for text generation.
Figure 2: TPR, TNR, and overall accuracy after self-reflection using the safety dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-safety']} for each of the seven prompts used in experiments.
Figure 3: TPR, TNR, and overall accuracy after self-reflection using the gender bias dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-gender']} for each of the four prompts used in experiments.
Figure 4: TPR, TNR, and overall accuracy after self-reflection using the ideological leaning dataset. The temperature value is set to 1 for text generation. See Appendix Figure \ref{['fig:prompt-ideol']} for each of the four prompts used in experiments.
Figure 5: The percentage of initial answers that are changed after self-reflection using GSM8K. The x-axis shows the percentage among answers that are initially wrong, while the y-axis shows the percentage among those that are initially correct.
...and 11 more figures

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

TL;DR

Abstract

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

Authors

TL;DR

Abstract

Table of Contents

Figures (16)