Self-correction is Not An Innate Capability in Large Language Models
Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Marie Johnson
TL;DR
The paper investigates whether moral self-correction is an innate capability of large language models (LLMs) by combining behavioral tests of moral sensitivity with mechanistic analyses of hidden states, using BBQ and RealToxicity benchmarks. It finds that LLMs are not morally sensitive and do not consistently leverage external feedback during self-correction; while both Chain-of-Thought (CoT) and external feedback can aid performance, their interaction often yields conflicts, undermining the hypothesized innate capability. The findings challenge the view that moral self-correction emerges from pretraining alone and highlight the need for better generator-evaluator alignment, including RL-based improvements and rational speech act-inspired feedback design. The work provides mechanistic insights into why self-correction frequently relies on shallow heuristics and proposes directions for future work in instruction-tuning and system design to foster genuine moral self-correction.
Abstract
Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.
