CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models
Cheng Cheng, Jinqiu Yang
TL;DR
CFCEval tackles the challenge of evaluating security aspects in code generated by large language models by introducing a bias-mitigating multi-language vulnerability benchmark (MLVBench) and a novel element-level relevance metric (ELRM). The framework defines four evaluation dimensions—Programming Language Quality, Fixing Capability, Post-Transformation Fixing Capability, and Element-Level Relevance—to capture correctness and security concerns beyond traditional CodeBLEU-based metrics. Empirical results show ELRM correlates more strongly with human and LLM judgments than CodeBLEU/BLEU, and the CFCEval setup with GPT-based prompts demonstrates potential as a comprehensive automatic benchmark. A pilot vulnerability-repair study confirms CFCEval's practicality and motivates expanding the benchmark across more languages and vulnerabilities.
Abstract
Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.
