Table of Contents
Fetching ...

Addressing Data Leakage in HumanEval Using Combinatorial Test Design

Jeremy S. Bradbury, Riddhi More

TL;DR

This work tackles data leakage in LLM benchmarks for software engineering by proposing a benchmark construction method based on template tasks and combinatorial test design to instantiate diverse, semantically equivalent concrete tasks. The authors implement HumanEval_T, a leakage-resistant variant built from a subset of HumanEval tasks, and empirically assess it against the original HumanEval using four state-of-the-art LLMs. Results show evidence of data leakage when comparing HumanEval to HumanEval_T and reveal model- and variant-dependent consistency, highlighting the need for dynamic evaluation approaches. The proposed method enables evolving benchmarks that maintain task semantics while supporting fair longitudinal comparisons, with potential applicability beyond code generation to other SE benchmarking domains.

Abstract

The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.

Addressing Data Leakage in HumanEval Using Combinatorial Test Design

TL;DR

This work tackles data leakage in LLM benchmarks for software engineering by proposing a benchmark construction method based on template tasks and combinatorial test design to instantiate diverse, semantically equivalent concrete tasks. The authors implement HumanEval_T, a leakage-resistant variant built from a subset of HumanEval tasks, and empirically assess it against the original HumanEval using four state-of-the-art LLMs. Results show evidence of data leakage when comparing HumanEval to HumanEval_T and reveal model- and variant-dependent consistency, highlighting the need for dynamic evaluation approaches. The proposed method enables evolving benchmarks that maintain task semantics while supporting fair longitudinal comparisons, with potential applicability beyond code generation to other SE benchmarking domains.

Abstract

The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.

Paper Structure

This paper contains 16 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Benchmark Construction Approach Using Combinatorial Test Design
  • Figure 2: Performance Distribution Across HumanEval_ T variants (V1-V5) with HumanEval
  • Figure 3: Performance Comparison of Models with HumanEval scores