Table of Contents
Fetching ...

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi

TL;DR

RefineBench tackles the challenge of language-model self-refinement by introducing a large, domain-rich benchmark that tests both self- and guided refinement across multi-turn interactions using a unified checklist. The authors show that without explicit feedback, frontier LMs struggle to meaningfully improve over five turns, while guided refinement dramatically boosts performance for many models, highlighting a gap in self-refinement capabilities. The dataset construction combines problem sourcing, careful checklist creation, and human validation, and the evaluation framework includes extrinsic and intrinsic tasks with verifiable and non-verifiable items. Overall, RefineBench provides a valuable, cost-aware testbed for tracking progress in refinement capabilities and points to essential research directions for enabling robust self-refinement in future LMs.

Abstract

Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

TL;DR

RefineBench tackles the challenge of language-model self-refinement by introducing a large, domain-rich benchmark that tests both self- and guided refinement across multi-turn interactions using a unified checklist. The authors show that without explicit feedback, frontier LMs struggle to meaningfully improve over five turns, while guided refinement dramatically boosts performance for many models, highlighting a gap in self-refinement capabilities. The dataset construction combines problem sourcing, careful checklist creation, and human validation, and the evaluation framework includes extrinsic and intrinsic tasks with verifiable and non-verifiable items. Overall, RefineBench provides a valuable, cost-aware testbed for tracking progress in refinement capabilities and points to essential research directions for enabling robust self-refinement in future LMs.

Abstract

Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.

Paper Structure

This paper contains 69 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: (Left) Strong LMs such as Claude-Sonnet-4 can self-refine effectively on AIME-24, where they already solve problems reasonably well in the first iteration. However, on saturated benchmarks such as MATH-500, there is little headroom for improvement, and on our proposed benchmark, RefineBench, performance gains remain limited. Hence, RefineBench serves as a testbed for measuring self-refinement capability of frontier LMs. (Right) The biggest bottleneck when an LM (Gemini-2.5-Pro) refines its output is that it often struggles to identify which aspects need to be corrected. In RefineBench, beyond the self-refinement setting where the LM must independently identify and fix errors, we also introduce settings where partial hints are provided about what needs to be revised, or where the amount of feedback varies. This enables a systematic analysis of refinement capability.
  • Figure 1: Basic Statistics.
  • Figure 2: An example from RefineBench (left) and an overview of the two evaluation protocols (i.e., self-refinement, guided refinement) in RefineBench (right).
  • Figure 3: Distribution of domain categories.
  • Figure 4: Partial guided refinement performance with the provided feedback ratio 50%.
  • ...and 9 more figures