Table of Contents
Fetching ...

Self-Evolving Critique Abilities in Large Language Models

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

TL;DR

SCRIT addresses scalable oversight by enabling LLMs to self-evolve critique abilities through a contrastive data synthesis pipeline and self-validation, removing reliance on external supervisors. The method generates high-quality critique data using reference solutions, validates corrections, and self-trains without requiring ground-truth critiques at inference. Empirical results show consistent gains in critique accuracy and error identification across math and science reasoning tasks, with benefits scaling with data and model size and robust cross-domain generalization. The work also offers insights into the importance of self-validation, domain diversity, and the effectiveness of contrastive critique over baseline direct critique methods, highlighting a practical path toward continuous self-improvement in LLMs. Future directions include applying SCRIT’s critiques to reinforcement learning loops and extending the framework to other structured reasoning domains.

Abstract

Despite their remarkable performance, Large Language Models (LLMs) face a critical challenge: providing feedback for tasks where human evaluation is difficult or where LLMs potentially outperform humans. In such scenarios, leveraging the critique ability of LLMs themselves - identifying and correcting flaws - shows considerable promise. This paper explores enhancing critique abilities of LLMs, noting that current approaches rely on human annotations or more powerful models, leaving the challenge of improving critique abilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that trains LLMs with self-generated data to evolve their critique abilities. To address the low quality of naively generated data, we propose a contrastive-critic approach that uses reference solutions during data synthesis to enhance the model's understanding of key concepts, and incorporates a self-validation scheme to ensure data quality. The final trained model operates without any reference solutions at inference time. Implemented with Qwen2.5-72B-Instruct, a leading LLM, SCRIT demonstrates consistent improvements across a wide range of benchmarks spanning both mathematical and scientific reasoning: achieving a 10.0\% relative gain in critique-correction accuracy and a 19.0\% relative improvement in error identification F1-score. Our analysis reveals that SCRIT's performance scales positively with data and model size and enables continuous improvement through multi-round iterations.

Self-Evolving Critique Abilities in Large Language Models

TL;DR

SCRIT addresses scalable oversight by enabling LLMs to self-evolve critique abilities through a contrastive data synthesis pipeline and self-validation, removing reliance on external supervisors. The method generates high-quality critique data using reference solutions, validates corrections, and self-trains without requiring ground-truth critiques at inference. Empirical results show consistent gains in critique accuracy and error identification across math and science reasoning tasks, with benefits scaling with data and model size and robust cross-domain generalization. The work also offers insights into the importance of self-validation, domain diversity, and the effectiveness of contrastive critique over baseline direct critique methods, highlighting a practical path toward continuous self-improvement in LLMs. Future directions include applying SCRIT’s critiques to reinforcement learning loops and extending the framework to other structured reasoning domains.

Abstract

Despite their remarkable performance, Large Language Models (LLMs) face a critical challenge: providing feedback for tasks where human evaluation is difficult or where LLMs potentially outperform humans. In such scenarios, leveraging the critique ability of LLMs themselves - identifying and correcting flaws - shows considerable promise. This paper explores enhancing critique abilities of LLMs, noting that current approaches rely on human annotations or more powerful models, leaving the challenge of improving critique abilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that trains LLMs with self-generated data to evolve their critique abilities. To address the low quality of naively generated data, we propose a contrastive-critic approach that uses reference solutions during data synthesis to enhance the model's understanding of key concepts, and incorporates a self-validation scheme to ensure data quality. The final trained model operates without any reference solutions at inference time. Implemented with Qwen2.5-72B-Instruct, a leading LLM, SCRIT demonstrates consistent improvements across a wide range of benchmarks spanning both mathematical and scientific reasoning: achieving a 10.0\% relative gain in critique-correction accuracy and a 19.0\% relative improvement in error identification F1-score. Our analysis reveals that SCRIT's performance scales positively with data and model size and enables continuous improvement through multi-round iterations.
Paper Structure (36 sections, 1 equation, 17 figures, 4 tables)

This paper contains 36 sections, 1 equation, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Direct critic (baseline) v.s. contrastive critic (ours). Left panel: input materials prepared for critique generation. Right panel: outputs from both approaches. The direct critic exhibits "rubber-stamping" behavior, incorrectly validating flawed solutions and providing misled feedback. The contrastive critic, however, utilizes reference solutions to grasp key concepts and strategies, enabling accurate error identification and correction.
  • Figure 2: Data statistics before and after self-critic and self-validation filtering.
  • Figure 3: Scaling and multi-round performance analysis. Left panel: Data size scaling of Contrastive Critic, Direct Critic, and Bug-Injection Critic. Middle panel: Model size scaling from 1.5B to 72B parameters. Right panel: Multi-round self-evolving over 3 iterations.
  • Figure 4: System prompts used for different critic mechanisms. Top Left: Direct Critic directly analyzes solution correctness without any additional context. Bottom Left: Bug-Injection Critic first injects bugs (Step 1) then direct critic on bug-injected solution (Step 2). Right: Contrastive Critic first analyzes a reference solution to understand key mathematical concepts before conducting step-wise critique.
  • Figure 5: Comparison between Direct Critic and Contrastive Critic. Direct Critic shows blind approval of the student solution, failing to identify any errors and providing misleading approval. In contrast, Contrastive Critic first analyzes the reference solution to understand key mathematical concepts, enabling it to precisely locate the error in the student solution. By developing understanding of the underlying mathematical concepts, Contrastive Critic successfully generate an effective critique that guides the correction process to reach the correct final answer.
  • ...and 12 more figures