Table of Contents
Fetching ...

A Survey on Evaluating Large Language Models in Code Generation Tasks

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, Shikun Zhang

TL;DR

This survey catalogs the metrics, benchmarks, and practices used to evaluate large language models in code generation. It organizes evaluation into similarity, execution-based, and human-centric methods, detailing code-specific metrics like CodeBLEU and CodeXGLUE benchmarks, as well as efficiency-focused frameworks such as EffiBench and Mercury. It highlights pragmatic challenges, including language coverage gaps and test-case depth, and outlines future directions such as multimodal, context-aware, ethical, and automated CI/CD-integrated evaluation, advocating for greater human-AI collaboration in assessment. The work provides a structured reference for researchers and practitioners to benchmark, compare, and improve LLM-driven code generation systems.

Abstract

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for further optimizing and improving the application of LLMs in code generation tasks.

A Survey on Evaluating Large Language Models in Code Generation Tasks

TL;DR

This survey catalogs the metrics, benchmarks, and practices used to evaluate large language models in code generation. It organizes evaluation into similarity, execution-based, and human-centric methods, detailing code-specific metrics like CodeBLEU and CodeXGLUE benchmarks, as well as efficiency-focused frameworks such as EffiBench and Mercury. It highlights pragmatic challenges, including language coverage gaps and test-case depth, and outlines future directions such as multimodal, context-aware, ethical, and automated CI/CD-integrated evaluation, advocating for greater human-AI collaboration in assessment. The work provides a structured reference for researchers and practitioners to benchmark, compare, and improve LLM-driven code generation systems.

Abstract

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for further optimizing and improving the application of LLMs in code generation tasks.
Paper Structure (39 sections, 1 equation, 6 figures)

This paper contains 39 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Classification of Evaluation Metrics for Code Generation Benchmarks.
  • Figure 2: Classification of Code Generation Benchmarks.
  • Figure 3: Pass@1 Performance of LLMs on HumanEval Over Time.
  • Figure 4: Pass@1 Performance of LLMs on MBPP Over Time.
  • Figure 5: Execution Time (ET) Evaluation of LLMs on EffiBench Over Release Time.
  • ...and 1 more figures