PurpCode: Reasoning for Safer Code Generation
Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang
TL;DR
Problem: Large language models used for code generation can be exploited to produce malicious code or insecure implementations. Approach: PurpCode employs a two-stage alignment: Rule Learning uses supervised fine-tuning on a corpus of cybersafety rules and secure coding practices, followed by Reinforcement Learning with a diverse, multi-objective reward to balance safety and utility; internal red-teaming generates comprehensive unsafe prompts to expose failure modes. Contributions: First open-source cybersafety reasoning training recipe, including training infrastructure, datasets, synthesizers, and evaluators; PurpCode-32B achieves state-of-the-art cybersafety and reduces over-refusal while preserving code-generation and security-knowledge utility. Significance: Enables safer code generation in practice and provides a reproducible framework for cybersafety alignment.
Abstract
We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.
