Narrowing the Complexity Gap in the Evaluation of Large Language Models

Yang Chen; Shuyang Liu; Reyhaneh Jabbarvand

Narrowing the Complexity Gap in the Evaluation of Large Language Models

Yang Chen, Shuyang Liu, Reyhaneh Jabbarvand

TL;DR

GeneBench proposes a fully automated, task-agnostic framework that uses a multi-objective genetic algorithm to transform existing programming benchmarks into more real-world–like, complex problems while preserving readability. By evaluating 13 LLMs across four code tasks on transformed benchmarks, the study shows a substantial performance drop (avg around $35\%$), indicating that current LLMs struggle with real-world complexity beyond standard benchmarks. The approach addresses data contamination, overfitting, and benchmark saturation by generating diverse, semantically equivalent but more complex problems, and its results align with real-world repair benchmarks like SWE-Bench. These findings imply that GeneBench provides a practical, scalable proxy for real-world evaluation without costly mining, with potential to guide future LLM training and evaluation strategies.

Abstract

Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by mining or augmenting open-source repositories. Such solutions are usually task-specific. Data quality control from real-world projects can also be time-consuming and error-prone. More importantly, evaluating LLMs on fixed benchmark problems is subject to data contamination and overfitting. We propose GeneBench, an automated technique to add real-world complexities to any programming benchmark. GeneBench leverages a multi-objective optimization to increase the complexity of programming problems while maintaining the readability of code similar to real-world programs. Transforming four widely-used programming benchmarks using GeneBench and evaluating 13 LLMs (including two reasoning LLMs) on them shows a notable performance drop across all programming tasks (14.9%-60.5%, avg=35.2%), demonstrating LLMs' struggle under real-world complexities. The struggle persists even when LLMs are few-shot prompted or fine-tuned with examples from different versions of GeneBench, demonstrating the challenging nature of the problems. Finally, we show that the performance of the studied LLMs in bug repair is similar under GeneBench and SWE-Bench. This, along with the consistent reproduction of performance drop of all studied LLMs across four tasks under different versions of GeneBench, makes the technique suitable to evaluate LLMs without costly construction of real-world benchmarks.

Narrowing the Complexity Gap in the Evaluation of Large Language Models

TL;DR

), indicating that current LLMs struggle with real-world complexity beyond standard benchmarks. The approach addresses data contamination, overfitting, and benchmark saturation by generating diverse, semantically equivalent but more complex problems, and its results align with real-world repair benchmarks like SWE-Bench. These findings imply that GeneBench provides a practical, scalable proxy for real-world evaluation without costly mining, with potential to guide future LLM training and evaluation strategies.

Abstract

Paper Structure (27 sections, 2 equations, 15 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 2 equations, 15 figures, 6 tables, 2 algorithms.

Introduction
Problem Statement and Challenges
GeneBench
Genetic Algorithm
Fitness Evaluation
Evolution
Chromosome Selection
Chromosome Manipulation
Transformation Operators
Evaluation
Experiment Setup
Subject LLMs
Tasks and Subject Benchmarks
RQ1: Properties of Transformations
RQ2: Effectiveness and Analysis of Failure
...and 12 more sections

Figures (15)

Figure 1: Code complexity of existing benchmarks
Figure 2: GeneBench transformation of CRUXEval-409
Figure 3: Real-world code example from Astropy astropy
Figure 4: Genetic makeup in GeneBench
Figure 5: Selection from Pareto Front
...and 10 more figures

Narrowing the Complexity Gap in the Evaluation of Large Language Models

TL;DR

Abstract

Narrowing the Complexity Gap in the Evaluation of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)