Table of Contents
Fetching ...

Self-correction is Not An Innate Capability in Large Language Models

Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Marie Johnson

TL;DR

The paper investigates whether moral self-correction is an innate capability of large language models (LLMs) by combining behavioral tests of moral sensitivity with mechanistic analyses of hidden states, using BBQ and RealToxicity benchmarks. It finds that LLMs are not morally sensitive and do not consistently leverage external feedback during self-correction; while both Chain-of-Thought (CoT) and external feedback can aid performance, their interaction often yields conflicts, undermining the hypothesized innate capability. The findings challenge the view that moral self-correction emerges from pretraining alone and highlight the need for better generator-evaluator alignment, including RL-based improvements and rational speech act-inspired feedback design. The work provides mechanistic insights into why self-correction frequently relies on shallow heuristics and proposes directions for future work in instruction-tuning and system design to foster genuine moral self-correction.

Abstract

Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.

Self-correction is Not An Innate Capability in Large Language Models

TL;DR

The paper investigates whether moral self-correction is an innate capability of large language models (LLMs) by combining behavioral tests of moral sensitivity with mechanistic analyses of hidden states, using BBQ and RealToxicity benchmarks. It finds that LLMs are not morally sensitive and do not consistently leverage external feedback during self-correction; while both Chain-of-Thought (CoT) and external feedback can aid performance, their interaction often yields conflicts, undermining the hypothesized innate capability. The findings challenge the view that moral self-correction emerges from pretraining alone and highlight the need for better generator-evaluator alignment, including RL-based improvements and rational speech act-inspired feedback design. The work provides mechanistic insights into why self-correction frequently relies on shallow heuristics and proposes directions for future work in instruction-tuning and system design to foster genuine moral self-correction.

Abstract

Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.

Paper Structure

This paper contains 22 sections, 17 figures, 23 tables.

Figures (17)

  • Figure 1: Mistral-7B.Self-distinguishing experimental results for the three representative biases (physical, religion and sexual orientation) in BBQ. The baseline (red) denotes results when we directly instruct LLMs to make a decision, representing the fundamental ability of LLMs in detecting the generally stereotyped social group mentioned in the context. Additional experimental results are presented in Figure \ref{['fig:distinguish-bbq-appendix']}.
  • Figure 2: Mistral-7B. Self-distinguishing experimental results for the RealToxicity benchmark, across all the used self-correction methods. The red solid line represents the ratio of samples where the self-correction method successfully reduced toxicity in the final round compared to the first round. Additional results are in Appendix \ref{['app:moreresults4deepseek']}.
  • Figure 3: Mistral-7B.BBQ-Age.Two subfigures on the left: The activated warrants in feedback with extrinsic (ext). We also examine the activated warrants by removing the feedback within the input, as shown with the red line of ext-W/O-feedback, and the activated warrants through the feedback alone. Two subfigures on the right: The activated warrants in CoT with CoT-enhanced intrinsic self-correction (int-CoT), and the control experiments by removing CoT from inputs at each round. We discard the rounds for generating CoT. See more results of other BBQ bias types in Appendix \ref{['app:mechanism_1']}
  • Figure 4: Mistral-7B.RealToxicity.Left: The activated warrant in feedback with extrinsic (ext). We also examine the activated warrant by removing the feedback within the input, as shown with the red line of ext-W/O-feedback, and the activated warrant through the feedback alone. Feedback is only used since the $2^{nd}$ round and afterwards.Right: The activated warrant in CoT with CoT-enhanced intrinsic self-correction (int-CoT), and the control experiments by removing CoT from inputs at each round. We discard the rounds for generating CoT.
  • Figure 5: Mistral-7B. Mechanistic analysis to CoT-enhanced extrinsic self-correction (ext-CoT) for BBQ-Age. Left and Middle: the activated warrants from CoT generated through settings with or without feedback. The blue dashed line represents the $1^{st}$ round CoT from the LLMs, serving as a reference point. Right: the IFD score for CoT and feedback when LLMs are instructed to generate a response. See more results of other BBQ bias types and other models in Appendix \ref{['app:mechanism_2']} & \ref{['app:results4othermodels']} respectively.
  • ...and 12 more figures