Table of Contents
Fetching ...

On the Convergence of Moral Self-Correction in Large Language Models

Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie Johnson

TL;DR

The paper investigates why moral self-correction in large language models converges under multi-round prompting. Using a mechanistic framing that centers on latent concepts (positive/negative moral signals) and model uncertainty, it demonstrates converged performance across six tasks and multiple models, with QA tasks typically settling after the first round and generation tasks requiring more rounds. It provides empirical evidence that instructions activate morality-related latent concepts, which in turn reduce semantic uncertainty and calibration errors, guiding outputs toward stability. A targeted simulation shows a strong link between concept activation and uncertainty reduction, supporting a causal interpretation of the convergence mechanism. The work suggests that intrinsic self-correction is a practical pathway to robust alignment, driven by internal representations rather than external feedback.

Abstract

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

On the Convergence of Moral Self-Correction in Large Language Models

TL;DR

The paper investigates why moral self-correction in large language models converges under multi-round prompting. Using a mechanistic framing that centers on latent concepts (positive/negative moral signals) and model uncertainty, it demonstrates converged performance across six tasks and multiple models, with QA tasks typically settling after the first round and generation tasks requiring more rounds. It provides empirical evidence that instructions activate morality-related latent concepts, which in turn reduce semantic uncertainty and calibration errors, guiding outputs toward stability. A targeted simulation shows a strong link between concept activation and uncertainty reduction, supporting a causal interpretation of the convergence mechanism. The work suggests that intrinsic self-correction is a practical pathway to robust alignment, driven by internal representations rather than external feedback.

Abstract

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

Paper Structure

This paper contains 17 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: Applying multi-round intrinsic self-correction for the task of text detoxification in a conversation scenario. By injecting self-correction instructions (bold font) into queries (green text boxes) for several rounds, the toxicity level of generated sentences (blue text boxes) decline and ultimately approach convergence. Our experiments show this convergence can be achieved, on average, within 6 rounds of self-correction. We investigate how the latent concept and model uncertainty drive LLMs towards convergence, thus achieving stable performance on downstream tasks, e.g., decreasing toxicity. By injecting instructions during multi-round self-correction, positive/moral concepts are activated and model uncertainty is reduced.
  • Figure 2: The logical framework of our analysis considers two key variables: latent concept and model uncertainty. A positive (moral) concept implies that the activated concept aligns with the self-correction objective, such as fairness or non-toxicity. We hypothesize that the injected self-correction instruction can activate the desired concept, which in turn reduces model uncertainty. This reduction ultimately leads to converged self-correction performance.
  • Figure 3: The self-correction performance for six different tasks, including both language generation tasks and multi-choice tasks. The x-axis represents the self-correction roun, and the y-axis indicates the performance evaluated on the corresponding task. The performance of self-correction improves as the interaction round progresses and converges eventually. The self-correction performance of the social bias mitigation task and the jailbreak defense task reaches the best performance in the first round and maintains this optimal performance with no modification for the rest of the interaction rounds.
  • Figure 4: The evolution of activated concepts. The evolution of activated concepts for (a) QA tasks and (b) generation tasks. For the generation task, we also implement experiments by injecting immoral instructions for all rounds and for some rounds.
  • Figure 5: The reported model uncertainty for the language generation and QA tasks, through the lens of self-correction rounds. For QA tasks, we show results for four social bias dimensions, i.e., Physical, Sexual, Religion, and Disability. The uncertainty converged after 10 rounds; we show 20 rounds to indicate its convergence.
  • ...and 4 more figures