Table of Contents
Fetching ...

Learning to Generate Secure Code via Token-Level Rewards

Jiazheng Quan, Xiaodong Li, Bin Wang, Guo An, Like Liu, Degen Huang, Lin Liu, Chengbin Hou

TL;DR

Vul2Safe is proposed, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and SRCode is introduced, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, yet they remain prone to producing security vulnerabilities. Existing approaches commonly suffer from two key limitations: the scarcity of high-quality security data and coarse-grained reinforcement learning reward signals. To address these challenges, we propose Vul2Safe, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and further generates diverse implicit prompts to build the PrimeVul+ dataset. Meanwhile, we introduce SRCode, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security, which enables the model to continuously attend to and reinforce critical fine-grained security patterns during training. Compared with traditional instance-level reward schemes, our approach allows for more precise optimization of local security implementations. Extensive experiments show that PrimeVul+ and SRCode substantially reduce security vulnerabilities in generated code while improving overall code quality across multiple benchmarks.

Learning to Generate Secure Code via Token-Level Rewards

TL;DR

Vul2Safe is proposed, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and SRCode is introduced, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, yet they remain prone to producing security vulnerabilities. Existing approaches commonly suffer from two key limitations: the scarcity of high-quality security data and coarse-grained reinforcement learning reward signals. To address these challenges, we propose Vul2Safe, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and further generates diverse implicit prompts to build the PrimeVul+ dataset. Meanwhile, we introduce SRCode, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security, which enables the model to continuously attend to and reinforce critical fine-grained security patterns during training. Compared with traditional instance-level reward schemes, our approach allows for more precise optimization of local security implementations. Extensive experiments show that PrimeVul+ and SRCode substantially reduce security vulnerabilities in generated code while improving overall code quality across multiple benchmarks.
Paper Structure (24 sections, 13 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our Methodology. (1) Vul2Safe transforms real-world vulnerable code into high-quality secure repair data with diverse implicit prompts. (2) PrimeVul+ ranks the samples based on four metrics and classifies them into three progressively difficult, curriculum-style tasks. (3) SRCode first applies SFT on detection and repair tasks, and then performs RL on secure code generation with token-level rewards (TLR) for fine-grained safety optimization.
  • Figure 2: Comparison of Sampling Efficiency Gaps ($\Delta_{SE}$). A smaller $\Delta_{SE}$ indicates that the RL-trained model is closer to the theoretical performance upper bound of the base model under large-k sampling.
  • Figure 3: Vulnerability distribution on CodeLMSec across different methods. The number of High, Medium, and Low severity vulnerabilities are reported, representing high-risk, medium-risk, and low-risk security issues, respectively. Each subplot is for a specific base model. Lower vulnerability counts indicate safer model behavior.