Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao; Zilong Wang; Liyuan Liu; Junxia Cui; Li Zhong; Xiaohan Fu; Haohui Mai; Vish Krishnan; Jianfeng Gao; Jingbo Shang

Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, Jingbo Shang

TL;DR

Production-grade code generation with LLMs remains challenging due to security and maintainability gaps alongside functional correctness. REAL introduces a prompt-agnostic RL framework that couples program-analysis–based vulnerability detectors with unit-test functionality checks to form a hybrid reward $r_{hybrid}$, guiding policy optimization via PPO. A dedicated vulnerability detector covering multiple CWEs, plus static analyses like MyPy, enable scalable, automated feedback across SecCodePLT+, SafeSQL, and APPS+ benchmarks. Experimental results show REAL consistently improves both security quality and maintainability, as well as joint metrics, across model scales, demonstrating a practical pathway to production-ready code with minimal human supervision.

Abstract

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

Training Language Models to Generate Quality Code with Program Analysis Feedback

TL;DR

Abstract

Training Language Models to Generate Quality Code with Program Analysis Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)