Table of Contents
Fetching ...

Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Ken Tsui

TL;DR

This work reveals a systemic Self-Correction Blind Spot in LLMs: models reliably correct external errors but struggle to correct their own outputs. By introducing Self-Correction Bench with controlled error injection across three task complexities, the study quantifies a mean blind-spot rate of $64.5\%$ on 14 open-source models. A minimal test-time prompt, such as a single "Wait" token, activates dormant self-correction pathways and reduces blind spots by $89.3\%$, highlighting a practical route to improve reliability without fine-tuning. The authors connect this behavior to training-data biases and the density of correction markers in post-training data, and they show reasoning models differ in their self-correction dynamics, suggesting paths to bridge capabilities via prompt design and data curation.

Abstract

Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal "Wait" prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains.

Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

TL;DR

This work reveals a systemic Self-Correction Blind Spot in LLMs: models reliably correct external errors but struggle to correct their own outputs. By introducing Self-Correction Bench with controlled error injection across three task complexities, the study quantifies a mean blind-spot rate of on 14 open-source models. A minimal test-time prompt, such as a single "Wait" token, activates dormant self-correction pathways and reduces blind spots by , highlighting a practical route to improve reliability without fine-tuning. The authors connect this behavior to training-data biases and the density of correction markers in post-training data, and they show reasoning models differ in their self-correction dynamics, suggesting paths to bridge capabilities via prompt design and data curation.

Abstract

Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal "Wait" prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains.

Paper Structure

This paper contains 27 sections, 2 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Example of error injection. Grey color shows model completion. Above: Error injection in model; Below: Error injection in user message
  • Figure 2: Self-Correction Blind Spot and 95% confidence interval across models
  • Figure 3: left: Blind spot correlation matrix middle: Scatter plot between SCLI5 vs GSM8K-SC right: Scatter plot between GSM8K-SC vs PRM800K-SC BCA: Before commit an answer
  • Figure 4: Macro average accuracy by non-reasoning model increases from original to appended "Wait"
  • Figure 5: left: Mean accuracy correlation matrix across datasets middle: Scatter plot between SCLI5 vs GSM8K-SC right: Scatter plot between GSM8K-SC vs PRM800K-SC BCA: Before commit an answer
  • ...and 11 more figures