Table of Contents
Fetching ...

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu, Cuiyun Gao

TL;DR

This work tackles the challenge of evaluating LLM-driven code generation in complex scenarios, where existing methods struggle with multi-dimensional context and explainability. It introduces CodeVisionary, an agent-based two-stage framework consisting of Requirement-guided Context Distillation (RMCD) and Fine-grained Scoring And Summarization (FSAS), augmented by multi-judge negotiation to produce both a numeric score and a detailed report. Through a CodeArena-derived benchmark (363 samples, 37 coding scenarios, 23 languages), CodeVisionary outperforms baselines (VANILLA, ICE-SCORE, CODEJUDGE) across Pearson, Spearman, and Kendall-Tau metrics, with significant gains attributed to richer contextual information and collaborative scoring. The approach enhances explainability and integration potential into CI/CD pipelines, offering a practical path to rigorous, fine-grained evaluation of code-generation capabilities in modern LLMs. Overall, CodeVisionary demonstrates that agent-based, context-rich evaluation paired with negotiated scoring yields more reliable and interpretable assessments of complex code generation tasks.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

TL;DR

This work tackles the challenge of evaluating LLM-driven code generation in complex scenarios, where existing methods struggle with multi-dimensional context and explainability. It introduces CodeVisionary, an agent-based two-stage framework consisting of Requirement-guided Context Distillation (RMCD) and Fine-grained Scoring And Summarization (FSAS), augmented by multi-judge negotiation to produce both a numeric score and a detailed report. Through a CodeArena-derived benchmark (363 samples, 37 coding scenarios, 23 languages), CodeVisionary outperforms baselines (VANILLA, ICE-SCORE, CODEJUDGE) across Pearson, Spearman, and Kendall-Tau metrics, with significant gains attributed to richer contextual information and collaborative scoring. The approach enhances explainability and integration potential into CI/CD pipelines, offering a practical path to rigorous, fine-grained evaluation of code-generation capabilities in modern LLMs. Overall, CodeVisionary demonstrates that agent-based, context-rich evaluation paired with negotiated scoring yields more reliable and interpretable assessments of complex code generation tasks.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.

Paper Structure

This paper contains 32 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples for illustrating the limitations of LLM-based approaches for evaluating code generation. Each example includes the code generation task, the generated code, and the evaluation of ICE-SCORE. ICE-SCORE ratings range from 0 to 4, with higher scores indicating higher quality.
  • Figure 2: The architecture of CodeVisionary. It consists of two main stages: (a) Requirement-guided multi-dimensional context distillation stage for collecting contextual information based on the stepwise evaluation plan, and (b) Fine-grained scoring and summarization stage for generating evaluation scores and reports through negotiation with diverse viewpoints.
  • Figure 3: The influence of the number of judges and maximum number of rounds on CodeVisionary. The horizontal axis represents the number of judges or the maximum number of rounds.
  • Figure 4: Performance of CodeVisionary and baseline methods across different coding scenarios and programming languages, measured by $r_s$.
  • Figure 5: Example evaluation report generated by CodeVisionary.