Table of Contents
Fetching ...

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li

TL;DR

The paper tackles data contamination in code benchmarks used to evaluate code understanding by large language models. It proposes dynamic benchmarking that generates semantics-preserving mutations to create syntactically distinct inputs and preserve outputs, i.e., for all inputs $x$, $[[p_i]]_x = [[p_i']]_x$. The approach introduces VarNormI/II and code structure mutations such as Constant Unfolding, For2While, and Condition Augmentation, and evaluates across CRUXEval, Avatar, CodeNet, and TransCoder, reporting substantial performance drops and some ranking shifts, with BLEU-based similarity scores indicating controlled perturbation. These results demonstrate that dynamic benchmarks resist data leakage and enable fairer differentiation among models, offering a practical path toward robust code reasoning evaluation.

Abstract

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

TL;DR

The paper tackles data contamination in code benchmarks used to evaluate code understanding by large language models. It proposes dynamic benchmarking that generates semantics-preserving mutations to create syntactically distinct inputs and preserve outputs, i.e., for all inputs , . The approach introduces VarNormI/II and code structure mutations such as Constant Unfolding, For2While, and Condition Augmentation, and evaluates across CRUXEval, Avatar, CodeNet, and TransCoder, reporting substantial performance drops and some ranking shifts, with BLEU-based similarity scores indicating controlled perturbation. These results demonstrate that dynamic benchmarks resist data leakage and enable fairer differentiation among models, offering a practical path toward robust code reasoning evaluation.

Abstract

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.

Paper Structure

This paper contains 24 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Answers produced by GPT-4o mini and DeepSeek V3 . The modified variables are highlighted in blue for identification.
  • Figure 2: The dynamic benchmarking framework.
  • Figure 3: Illustrative examples of our code syntax and code structure mutations. (a) is the code from the original benchmark, while (b)-(f) are new codes by applying one of our mutations.
  • Figure 4: Comparison of model performance distributions under different mutations of CodeNet benchmark.
  • Figure 5: Comparison of Pass@1 scores on static and dynamic CRUXEval, with and without(w/o) fine-tuning the model.
  • ...and 2 more figures