Table of Contents
Fetching ...

SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song

TL;DR

SeCodePLT offers a unified, scalable benchmark for evaluating code GenAI security across Python, C/C++, and Java by combining manually validated seeds with automated mutations and dynamic testing across 44 CWEs and three capabilities (secure coding, vulnerability detection, patch generation). The framework achieves high data quality and broad coverage, producing 5.9k samples that enable extensive evaluation of SOTA models and agents, while revealing significant gaps in secure coding and vulnerability patching for current systems. Through comprehensive experiments and language-specific analyses, the paper demonstrates the ongoing limitations of general-purpose code LLMs in security-critical tasks and underscores the value of dynamic metrics over static judgments. It also outlines clear future directions, including expanding languages, risks, and testing modalities, and using SeCodePLT as a foundation for model fine-tuning to enhance security awareness in code GenAI.

Abstract

Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and (3) a trade-off between data quality and benchmark scale. To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations. Our approach provides a comprehensive suite of artifacts so the benchmark can support comprehensive risk assessment and security capability evaluation using dynamic metrics. By combining expert insights with automated generation, we strike a balance between manual effort, data quality, and benchmark scale. Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SeCodePLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SeCodePLT to evaluate leading code LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities.

SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

TL;DR

SeCodePLT offers a unified, scalable benchmark for evaluating code GenAI security across Python, C/C++, and Java by combining manually validated seeds with automated mutations and dynamic testing across 44 CWEs and three capabilities (secure coding, vulnerability detection, patch generation). The framework achieves high data quality and broad coverage, producing 5.9k samples that enable extensive evaluation of SOTA models and agents, while revealing significant gaps in secure coding and vulnerability patching for current systems. Through comprehensive experiments and language-specific analyses, the paper demonstrates the ongoing limitations of general-purpose code LLMs in security-critical tasks and underscores the value of dynamic metrics over static judgments. It also outlines clear future directions, including expanding languages, risks, and testing modalities, and using SeCodePLT as a foundation for model fine-tuning to enhance security awareness in code GenAI.

Abstract

Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and (3) a trade-off between data quality and benchmark scale. To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations. Our approach provides a comprehensive suite of artifacts so the benchmark can support comprehensive risk assessment and security capability evaluation using dynamic metrics. By combining expert insights with automated generation, we strike a balance between manual effort, data quality, and benchmark scale. Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SeCodePLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SeCodePLT to evaluate leading code LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities.

Paper Structure

This paper contains 41 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Our two-stage data creation pipeline.
  • Figure 2: Risk categories.
  • Figure 3: An example of our data for three tasks. In this example, the code validates image dimensions for a Nikon raw image decoder. The original validation checks for zero dimensions and upper bounds, but misses a constraint that the width must be even for proper decompression. Without showing the decompress implementation as context, this requirement is not apparent from the main code alone.
  • Figure 4: SeCodePLT vs. CyberSecEval in security relevance and prompt faithfulness on Python-related CWEs. The numbers outside the circles are CWE numbers.
  • Figure 5: Secure coding rate of the selected models against SeCodePLT. We test two tasks: code completion (above) and instruction generation (below). We report the results using the pass@1 metric. The solid and hatched bars represent the ratios without and with a security reminder.
  • ...and 11 more figures