ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

Debalina Ghosh Paul; Hong Zhu; Ian Bayley

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR

ScenEval addresses the challenge of evaluating code-generation models across diverse usage scenarios by attaching rich metadata to test cases and enabling scenario-focused dataset construction through a datamorphic testing workflow implemented in Morphy. It builds a large Java-focused benchmark with $12{,}864$ tasks sourced from textbooks, online tutorials, and Stack Overflow, each annotated with scenario metadata in JSON. The workflow employs seed makers, datamorphisms, metamorphisms, and test-set filters to generate and purify gamma ($\gamma$) and kappa ($\kappa$) test codes and to measure functional correctness via metrics such as $pass@1$ and the average pass rate. Empirical results with ChatGPT show performance declines with task complexity and topic difficulty, with generated code typically shorter but more complex when correct, underscoring the value of scenario-based evaluation for diagnosing weaknesses and guiding improvements.

Abstract

In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

TL;DR

tasks sourced from textbooks, online tutorials, and Stack Overflow, each annotated with scenario metadata in JSON. The workflow employs seed makers, datamorphisms, metamorphisms, and test-set filters to generate and purify gamma (

) and kappa (

) test codes and to measure functional correctness via metrics such as

and the average pass rate. Empirical results with ChatGPT show performance declines with task complexity and topic difficulty, with generated code typically shorter but more complex when correct, underscoring the value of scenario-based evaluation for diagnosing weaknesses and guiding improvements.

Abstract

Paper Structure (24 sections, 13 figures, 5 tables)

This paper contains 24 sections, 13 figures, 5 tables.

Introduction
Related Work
Scenario-Based Testing and Evaluation of ML
Benchmarks for Code Generation
Evaluation of Code Generation Capability
ScenEval Benchmark
Structure of Data
Data Procurement and Extraction
Datamorphic Test System
Test Set Filters.
Test Data Analysers.
Analysers of data distributions.
Analysers of test cases features.
Seed Makers
Test Executer
...and 9 more sections

Figures (13)

Figure 1: Structure of JSON representation of Task
Figure 2: Structure of JSON representation of Description and Type
Figure 3: Structure of JSON Representation of Various Types of Sources
Figure 4: Structure of JSON Representation of Reference Solutions
Figure 5: Example of a Test Case
...and 8 more figures

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

TL;DR

Abstract

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)