Table of Contents
Fetching ...

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-chi Cheung, Chang Xu

TL;DR

This work proposes JavaBench, a project-level Java benchmark that exercises OOP features and introduces a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics to better evaluate LLM’s capability against JavaBench.

Abstract

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench.

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

TL;DR

This work proposes JavaBench, a project-level Java benchmark that exercises OOP features and introduces a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics to better evaluate LLM’s capability against JavaBench.

Abstract

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench.
Paper Structure (34 sections, 1 equation, 6 figures, 6 tables)

This paper contains 34 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An Example of Project Skeleton in JavaBench
  • Figure 2: Generation Pipeline for a Java Project. Given a project to be complete, for each method with TODO, there are three types of (➊ $\sim$ ➌) Context Settings. On top of method completion, there are three Synthesis Strategies to complete an entire class.
  • Figure 3: Evaluation Design of Granularities and Metrics. To evaluate an LLM-generated project, two granularities (i.e., class-wise and test-wise) are adopted to replace the related classes to compile corresponding programs $P'_X$ where $X$ denotes a class (A-C) or a test (M-O). Then, three-fold evaluation metrics (i.e., completion, compilation, and pass) are applied to evaluate $P'_X$.
  • Figure 4: Number of Characters of Three Context Settings (i.e., Maximum/Minimum and Selected Context, Section \ref{['sec:context']}). Each color represents each project in JavaBench.
  • Figure 5: RQ3: Impact of Different Incremental Synthesis on DeepSeek-Coder-6.7b. Completion/Compilation/Pass@1 (Upper) and Completion/Compilation/Pass@5 (Lower).
  • ...and 1 more figures