Table of Contents
Fetching ...

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Cheng Cheng, Jinqiu Yang

TL;DR

CFCEval tackles the challenge of evaluating security aspects in code generated by large language models by introducing a bias-mitigating multi-language vulnerability benchmark (MLVBench) and a novel element-level relevance metric (ELRM). The framework defines four evaluation dimensions—Programming Language Quality, Fixing Capability, Post-Transformation Fixing Capability, and Element-Level Relevance—to capture correctness and security concerns beyond traditional CodeBLEU-based metrics. Empirical results show ELRM correlates more strongly with human and LLM judgments than CodeBLEU/BLEU, and the CFCEval setup with GPT-based prompts demonstrates potential as a comprehensive automatic benchmark. A pilot vulnerability-repair study confirms CFCEval's practicality and motivates expanding the benchmark across more languages and vulnerabilities.

Abstract

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

TL;DR

CFCEval tackles the challenge of evaluating security aspects in code generated by large language models by introducing a bias-mitigating multi-language vulnerability benchmark (MLVBench) and a novel element-level relevance metric (ELRM). The framework defines four evaluation dimensions—Programming Language Quality, Fixing Capability, Post-Transformation Fixing Capability, and Element-Level Relevance—to capture correctness and security concerns beyond traditional CodeBLEU-based metrics. Empirical results show ELRM correlates more strongly with human and LLM judgments than CodeBLEU/BLEU, and the CFCEval setup with GPT-based prompts demonstrates potential as a comprehensive automatic benchmark. A pilot vulnerability-repair study confirms CFCEval's practicality and motivates expanding the benchmark across more languages and vulnerabilities.

Abstract

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.

Paper Structure

This paper contains 23 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the CFCEval framework, highlighting the dataset, the four evaluation dimensions, and the corresponding metrics used to assess the code quality and security of Code LLMs.
  • Figure 2: Distribution of Vulnerabilities by CWE Types
  • Figure 3: The application of the CFCEval framework in our experiments, including vulnerable functions, vulnerable code, secure reference code, and generated code, illustrating the evaluation process.
  • Figure 4: Example 1. CodeBLEU: 22.03 , ELRM: 13.71.
  • Figure 5: Example 2. CodeBLEU: 62.87, ELRM: 94.95.