Table of Contents
Fetching ...

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Jiliang Tang, Kristen Johnson

TL;DR

This work investigates intrinsic self-correction in LLMs, focusing on whether iterative self-instruction can guarantee convergence and why it occurs. Using multi-round QA and morality-centric tasks, the authors provide empirical evidence that self-correction improves performance and converges, with distinct convergence rates for different task types. They link convergence to a reduction in model uncertainty and to the activation of latent concepts (moral orientations) via probing analyses, supplemented by a simple theoretical framework. The findings suggest that consistent injected instructions reduce uncertainty, leading to calibrated predictions and stable improvements, with practical implications for designing robust self-correction strategies.

Abstract

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only the task's goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. In this paper, we unveil that intrinsic self-correction can be progressively improved, allowing it to approach a converged state. Our findings are verified in: (1) the scenario of multi-round question answering, by comprehensively demonstrating that intrinsic self-correction can progressively introduce performance gains through iterative interactions, ultimately converging to stable performance; and (2) the context of intrinsic self-correction for enhanced morality, in which we provide empirical evidence that iteratively applying instructions reduces model uncertainty towards convergence, which then leads to convergence of both the calibration error and self-correction performance, ultimately resulting in a stable state of intrinsic self-correction. Furthermore, we introduce a mathematical formulation and a simulation task indicating that the latent concepts activated by self-correction instructions drive the reduction of model uncertainty. Based on our experimental results and analysis of the convergence of intrinsic self-correction, we reveal its underlying mechanism: consistent injected instructions reduce model uncertainty which yields converged, improved performance.

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

TL;DR

This work investigates intrinsic self-correction in LLMs, focusing on whether iterative self-instruction can guarantee convergence and why it occurs. Using multi-round QA and morality-centric tasks, the authors provide empirical evidence that self-correction improves performance and converges, with distinct convergence rates for different task types. They link convergence to a reduction in model uncertainty and to the activation of latent concepts (moral orientations) via probing analyses, supplemented by a simple theoretical framework. The findings suggest that consistent injected instructions reduce uncertainty, leading to calibrated predictions and stable improvements, with practical implications for designing robust self-correction strategies.

Abstract

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only the task's goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. In this paper, we unveil that intrinsic self-correction can be progressively improved, allowing it to approach a converged state. Our findings are verified in: (1) the scenario of multi-round question answering, by comprehensively demonstrating that intrinsic self-correction can progressively introduce performance gains through iterative interactions, ultimately converging to stable performance; and (2) the context of intrinsic self-correction for enhanced morality, in which we provide empirical evidence that iteratively applying instructions reduces model uncertainty towards convergence, which then leads to convergence of both the calibration error and self-correction performance, ultimately resulting in a stable state of intrinsic self-correction. Furthermore, we introduce a mathematical formulation and a simulation task indicating that the latent concepts activated by self-correction instructions drive the reduction of model uncertainty. Based on our experimental results and analysis of the convergence of intrinsic self-correction, we reveal its underlying mechanism: consistent injected instructions reduce model uncertainty which yields converged, improved performance.
Paper Structure (22 sections, 3 equations, 6 figures)

This paper contains 22 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Applying multi-round intrinsic self-correction for the task of text detoxification in a question-answering scenario. By injecting self-correction instructions (bold font) into queries (green text boxes) for several rounds, the toxicity level of generated sentences (blue text boxes) decline and ultimately approach convergence. Our experiments show this convergence can be achieved, on average, within 6 rounds of self-correction. We investigate how the latent concept and model uncertainty drive LLMs towards convergence, thus achieving stable performance on downstream tasks, e.g., decreasing toxicity. By injecting instructions during multi-round self-correction, concepts are activated and model uncertainty is reduced.
  • Figure 2: The logical framework of our analysis considers two key variables: the concept and model uncertainty. A positive concept implies that the activated concept aligns with the self-correction objective, such as fairness or non-toxicity. We hypothesize that the injected self-correction instruction can activate the desired concept, which in turn reduces model uncertainty. This reduction in model uncertainty is expected to decrease and stabilize the calibration error, ultimately leading to converged self-correction performance.
  • Figure 3: The self-correction performance for six different tasks including both language generation tasks and multi-choice tasks. The x-axis represents the self-correction round and the y-axis indicates the performance evaluated on the corresponding task. The performance of self-correction improves as the interaction round progresses and converges eventually. The self-correction performance of the social bias mitigation task and the jailbreak defense task reaches the best performance in the first round and maintains this optimal performance with no modification for the rest of the interaction rounds.
  • Figure 4: The reported model uncertainty and calibration error for the language generation and QA tasks, through the lens of self-correction rounds. For QA tasks, we show results for four social bias dimensions, e.g., Physical, Sexual, Religion, and Disability. Since the ECE error converged in the first self-correction round, we add the value of baseline uncertainty and ECE error for reference, but the self-correction process starts from the first round. The uncertainty converged after 10 rounds; we show 20 rounds to indicate its convergence.
  • Figure 5: The evolution of activated concepts for (a) QA tasks and (b) generation tasks. For the generation task, we also implement intervention experiments by injecting immoral instruction for some or all rounds.
  • ...and 1 more figures