Table of Contents
Fetching ...

Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang

TL;DR

SmartCoder-R1 tackles the dual challenge of auditable reasoning and secure smart-contract generation. It introduces a three-stage pipeline—Continual Pre-Training, Long Chain-of-Thought Supervised Fine-Tuning, and Security-Aware Group Relative Policy Optimization—applied to Solidity to produce auditable, secure code. On a real-world benchmark, it achieves state-of-the-art performance across compilability, security, and functional correctness, with a Final FullRate of 50.53% and strong human evaluation of its reasoning. The work demonstrates that embedding explicit security reasoning and group-based reinforcement learning can yield verifiably secure, auditable smart-contract code with practical implications for on-chain safety and trust.

Abstract

Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable "black boxes" lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).

Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

TL;DR

SmartCoder-R1 tackles the dual challenge of auditable reasoning and secure smart-contract generation. It introduces a three-stage pipeline—Continual Pre-Training, Long Chain-of-Thought Supervised Fine-Tuning, and Security-Aware Group Relative Policy Optimization—applied to Solidity to produce auditable, secure code. On a real-world benchmark, it achieves state-of-the-art performance across compilability, security, and functional correctness, with a Final FullRate of 50.53% and strong human evaluation of its reasoning. The work demonstrates that embedding explicit security reasoning and group-based reinforcement learning can yield verifiably secure, auditable smart-contract code with practical implications for on-chain safety and trust.

Abstract

Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable "black boxes" lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).

Paper Structure

This paper contains 22 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A motivating example to illustrate the limitations of non-reasoning Code LLMs in secure smart contract generation, demonstrating the advantages of reasoning-enhanced Code LLMs.
  • Figure 2: Overview of our SmartCoder-R1 pipeline.
  • Figure 3: Case study comparing the reasoning process and implementation order of two LLMs (SmartCoder-R1 and DeepSeek-R1) on the renounceOwnership function in smart contracts.