SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI
Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song
TL;DR
SeCodePLT offers a unified, scalable benchmark for evaluating code GenAI security across Python, C/C++, and Java by combining manually validated seeds with automated mutations and dynamic testing across 44 CWEs and three capabilities (secure coding, vulnerability detection, patch generation). The framework achieves high data quality and broad coverage, producing 5.9k samples that enable extensive evaluation of SOTA models and agents, while revealing significant gaps in secure coding and vulnerability patching for current systems. Through comprehensive experiments and language-specific analyses, the paper demonstrates the ongoing limitations of general-purpose code LLMs in security-critical tasks and underscores the value of dynamic metrics over static judgments. It also outlines clear future directions, including expanding languages, risks, and testing modalities, and using SeCodePLT as a foundation for model fine-tuning to enhance security awareness in code GenAI.
Abstract
Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and (3) a trade-off between data quality and benchmark scale. To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations. Our approach provides a comprehensive suite of artifacts so the benchmark can support comprehensive risk assessment and security capability evaluation using dynamic metrics. By combining expert insights with automated generation, we strike a balance between manual effort, data quality, and benchmark scale. Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SeCodePLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SeCodePLT to evaluate leading code LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities.
