How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel; Siva Reddy; Dzmitry Bahdanau

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel, Siva Reddy, Dzmitry Bahdanau

TL;DR

CHASE introduces a scalable, fully automated framework to craft challenging evaluation benchmarks for LLMs across three domains: CHASE-QA (long-context document QA), CHASE-Code (repository-level code completion), and CHASE-Math (grade-school math word problems). It relies on a bottom-up problem construction strategy and a decomposition into independently verifiable sub-tasks using a generator G and a verifier V to ensure data quality. Empirically, state-of-the-art models achieve only 40–60% accuracy on these synthetic benchmarks, highlighting the difficulty of long-context reasoning and the need for robust evaluation data beyond existing datasets. By providing benchmarks and code publicly, CHASE offers a scalable, renewable, and domain-diverse approach to differentiating model capabilities and guiding future improvements in evaluation methodologies.

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

How to Get Your LLM to Generate Challenging Problems for Evaluation

TL;DR

Abstract

How to Get Your LLM to Generate Challenging Problems for Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (34)