Table of Contents
Fetching ...

Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, Jingbo Shang

TL;DR

Production-grade code generation with LLMs remains challenging due to security and maintainability gaps alongside functional correctness. REAL introduces a prompt-agnostic RL framework that couples program-analysis–based vulnerability detectors with unit-test functionality checks to form a hybrid reward $r_{hybrid}$, guiding policy optimization via PPO. A dedicated vulnerability detector covering multiple CWEs, plus static analyses like MyPy, enable scalable, automated feedback across SecCodePLT+, SafeSQL, and APPS+ benchmarks. Experimental results show REAL consistently improves both security quality and maintainability, as well as joint metrics, across model scales, demonstrating a practical pathway to production-ready code with minimal human supervision.

Abstract

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

Training Language Models to Generate Quality Code with Program Analysis Feedback

TL;DR

Production-grade code generation with LLMs remains challenging due to security and maintainability gaps alongside functional correctness. REAL introduces a prompt-agnostic RL framework that couples program-analysis–based vulnerability detectors with unit-test functionality checks to form a hybrid reward , guiding policy optimization via PPO. A dedicated vulnerability detector covering multiple CWEs, plus static analyses like MyPy, enable scalable, automated feedback across SecCodePLT+, SafeSQL, and APPS+ benchmarks. Experimental results show REAL consistently improves both security quality and maintainability, as well as joint metrics, across model scales, demonstrating a practical pathway to production-ready code with minimal human supervision.

Abstract

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

Paper Structure

This paper contains 33 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the ReaL framework. Given a coding task, the LLM policy generates a candidate program, which is then evaluated along two automated axes: (1) Vulnerability Detector applies program analysis to flag security and maintainability defects, (2) Functionality Verifier runs unit tests to assess correctness. The two reward signals are averaged and fed into a policy-gradient update, steering the LLM toward high-quality, functionally correct code with minimal human effort.
  • Figure 2: Maintainability issues detected by MyPy
  • Figure 3: Examples of code generated by ReaL$_\text{0.5B}$ at different training stages on SafeSQL with hybrid rewards. Initially, the model produces incorrect and insecure code, misinterpreting "or" as AND and directly incorporating unsanitized user inputs. Later on, it adopts parameterized execution (using ? placeholders in the query with separate parameter binding) to implicitly address vulnerabilities. Finally, it corrects the query logic and explicitly sanitizes user inputs with proper type conversion (using float($\cdot$) to enforce correct type conversion of the input).