On the Convergence of Moral Self-Correction in Large Language Models
Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie Johnson
TL;DR
The paper investigates why moral self-correction in large language models converges under multi-round prompting. Using a mechanistic framing that centers on latent concepts (positive/negative moral signals) and model uncertainty, it demonstrates converged performance across six tasks and multiple models, with QA tasks typically settling after the first round and generation tasks requiring more rounds. It provides empirical evidence that instructions activate morality-related latent concepts, which in turn reduce semantic uncertainty and calibration errors, guiding outputs toward stability. A targeted simulation shows a strong link between concept activation and uncertainty reduction, supporting a causal interpretation of the convergence mechanism. The work suggests that intrinsic self-correction is a practical pathway to robust alignment, driven by internal representations rather than external feedback.
Abstract
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
