Table of Contents
Fetching ...

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun

TL;DR

The RACE benchmark is proposed, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency, and highlights the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications.

Abstract

In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We analyze 28 representative LLMs based on RACE and find that: 1) current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios, while RACE provides a comprehensive evaluation that reveals the defects of LLMs across multiple dimensions; 2) the RACE benchmark serves as an effective tool for resisting the risk of data contamination; 3) even the most advanced code LLMs still encounter significant challenges in customized requirements involving complex instructions; 4) most LLMs exhibit an inherent preference for specific coding style. These findings highlight the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications. Future efforts should aim to develop novel learning algorithms to enhance code generation under varied constraints and improve coverage and usability for diverse user needs.

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

TL;DR

The RACE benchmark is proposed, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency, and highlights the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications.

Abstract

In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We analyze 28 representative LLMs based on RACE and find that: 1) current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios, while RACE provides a comprehensive evaluation that reveals the defects of LLMs across multiple dimensions; 2) the RACE benchmark serves as an effective tool for resisting the risk of data contamination; 3) even the most advanced code LLMs still encounter significant challenges in customized requirements involving complex instructions; 4) most LLMs exhibit an inherent preference for specific coding style. These findings highlight the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications. Future efforts should aim to develop novel learning algorithms to enhance code generation under varied constraints and improve coverage and usability for diverse user needs.
Paper Structure (25 sections, 2 equations, 10 figures, 7 tables)

This paper contains 25 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Current benchmarks perform single-dimension evaluations and mostly focus only on code correctness (upper right); our proposed RACE benchmark performs multi-dimensional code evaluations to identify truly high-quality code beyond correctness (lower right).
  • Figure 2: The overall evaluation pipeline in RACE benchmark.
  • Figure 3: Performance radar charts of several representative LLMs on the RACE benchmark, visually illustrating the capability distribution across 8 detailed factors.
  • Figure 4: The variation in instruction-following capabilities under different factors in the context of data contamination.
  • Figure 5: The variation in the ability of different LLMs to generate code that is correct and satisfies all requirements as the number of requirements in complex instructions gradually increases from 2 (2-req) to 5 (5-req).
  • ...and 5 more figures