CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation
Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu, Cuiyun Gao
TL;DR
This work tackles the challenge of evaluating LLM-driven code generation in complex scenarios, where existing methods struggle with multi-dimensional context and explainability. It introduces CodeVisionary, an agent-based two-stage framework consisting of Requirement-guided Context Distillation (RMCD) and Fine-grained Scoring And Summarization (FSAS), augmented by multi-judge negotiation to produce both a numeric score and a detailed report. Through a CodeArena-derived benchmark (363 samples, 37 coding scenarios, 23 languages), CodeVisionary outperforms baselines (VANILLA, ICE-SCORE, CODEJUDGE) across Pearson, Spearman, and Kendall-Tau metrics, with significant gains attributed to richer contextual information and collaborative scoring. The approach enhances explainability and integration potential into CI/CD pipelines, offering a practical path to rigorous, fine-grained evaluation of code-generation capabilities in modern LLMs. Overall, CodeVisionary demonstrates that agent-based, context-rich evaluation paired with negotiated scoring yields more reliable and interpretable assessments of complex code generation tasks.
Abstract
Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.
