Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models
Ken Tsui
TL;DR
This work reveals a systemic Self-Correction Blind Spot in LLMs: models reliably correct external errors but struggle to correct their own outputs. By introducing Self-Correction Bench with controlled error injection across three task complexities, the study quantifies a mean blind-spot rate of $64.5\%$ on 14 open-source models. A minimal test-time prompt, such as a single "Wait" token, activates dormant self-correction pathways and reduces blind spots by $89.3\%$, highlighting a practical route to improve reliability without fine-tuning. The authors connect this behavior to training-data biases and the density of correction markers in post-training data, and they show reasoning models differ in their self-correction dynamics, suggesting paths to bridge capabilities via prompt design and data curation.
Abstract
Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal "Wait" prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains.
