Table of Contents
Fetching ...

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR

ScenEval addresses the challenge of evaluating code-generation models across diverse usage scenarios by attaching rich metadata to test cases and enabling scenario-focused dataset construction through a datamorphic testing workflow implemented in Morphy. It builds a large Java-focused benchmark with $12{,}864$ tasks sourced from textbooks, online tutorials, and Stack Overflow, each annotated with scenario metadata in JSON. The workflow employs seed makers, datamorphisms, metamorphisms, and test-set filters to generate and purify gamma ($\gamma$) and kappa ($\kappa$) test codes and to measure functional correctness via metrics such as $pass@1$ and the average pass rate. Empirical results with ChatGPT show performance declines with task complexity and topic difficulty, with generated code typically shorter but more complex when correct, underscoring the value of scenario-based evaluation for diagnosing weaknesses and guiding improvements.

Abstract

In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

TL;DR

ScenEval addresses the challenge of evaluating code-generation models across diverse usage scenarios by attaching rich metadata to test cases and enabling scenario-focused dataset construction through a datamorphic testing workflow implemented in Morphy. It builds a large Java-focused benchmark with tasks sourced from textbooks, online tutorials, and Stack Overflow, each annotated with scenario metadata in JSON. The workflow employs seed makers, datamorphisms, metamorphisms, and test-set filters to generate and purify gamma () and kappa () test codes and to measure functional correctness via metrics such as and the average pass rate. Empirical results with ChatGPT show performance declines with task complexity and topic difficulty, with generated code typically shorter but more complex when correct, underscoring the value of scenario-based evaluation for diagnosing weaknesses and guiding improvements.

Abstract

In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.
Paper Structure (24 sections, 13 figures, 5 tables)

This paper contains 24 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Structure of JSON representation of Task
  • Figure 2: Structure of JSON representation of Description and Type
  • Figure 3: Structure of JSON Representation of Various Types of Sources
  • Figure 4: Structure of JSON Representation of Reference Solutions
  • Figure 5: Example of a Test Case
  • ...and 8 more figures