Table of Contents
Fetching ...

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel, Siva Reddy, Dzmitry Bahdanau

TL;DR

CHASE introduces a scalable, fully automated framework to craft challenging evaluation benchmarks for LLMs across three domains: CHASE-QA (long-context document QA), CHASE-Code (repository-level code completion), and CHASE-Math (grade-school math word problems). It relies on a bottom-up problem construction strategy and a decomposition into independently verifiable sub-tasks using a generator G and a verifier V to ensure data quality. Empirically, state-of-the-art models achieve only 40–60% accuracy on these synthetic benchmarks, highlighting the difficulty of long-context reasoning and the need for robust evaluation data beyond existing datasets. By providing benchmarks and code publicly, CHASE offers a scalable, renewable, and domain-diverse approach to differentiating model capabilities and guiding future improvements in evaluation methodologies.

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

How to Get Your LLM to Generate Challenging Problems for Evaluation

TL;DR

CHASE introduces a scalable, fully automated framework to craft challenging evaluation benchmarks for LLMs across three domains: CHASE-QA (long-context document QA), CHASE-Code (repository-level code completion), and CHASE-Math (grade-school math word problems). It relies on a bottom-up problem construction strategy and a decomposition into independently verifiable sub-tasks using a generator G and a verifier V to ensure data quality. Empirically, state-of-the-art models achieve only 40–60% accuracy on these synthetic benchmarks, highlighting the difficulty of long-context reasoning and the need for robust evaluation data beyond existing datasets. By providing benchmarks and code publicly, CHASE offers a scalable, renewable, and domain-diverse approach to differentiating model capabilities and guiding future improvements in evaluation methodologies.

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

Paper Structure

This paper contains 65 sections, 34 figures, 12 tables.

Figures (34)

  • Figure 1: Top: Illustrating the high-level ideas behind our proposed CHASE framework. Bottom left: Pipeline for creating an example in CHASE-QA. Bottom right: Pipeline for creating a math word problem in CHASE-Math. The pipeline for CHASE-Code is illustrated in Figure \ref{['fig:chase_code']} in the Appendix.
  • Figure 1: The performance of various LLMs on all $3$ domains of the CHASE benchmark. We measure the accuracy of the predictions for CHASE-QA and CHASE-Math, and pass@1 for CHASE-Code. Data and Algo refer to the data pre-processing and algorithms sub-domains of CHASE-Code. Numbers in bold indicate best performance on domain while underline indicates best-in-class performance.
  • Figure 2: Examples of problems from all three benchmarks created using CHASE.
  • Figure 2: Performance of LLMs on data generated by direct prompting approaches without using CHASE.
  • Figure 3: Performance of LLMs decreases uniformly with increasing context sizes for the 100 example subset of CHASE-QA (top) and the 55 example subset of CHASE-Code (bottom).
  • ...and 29 more figures