Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching
Zanis Ali Khan, Aayush Garg, Yuejun Guo, Qiang Tang
TL;DR
The paper addresses vulnerability-focused program repair, evaluating two pre-trained code models, CodeBERT and CodeT5, across nine datasets in five programming languages. It uses CodeBLEU and CrystalBLEU to measure patch accuracy and analyzes computational efficiency and the impact of patch length. Results show CodeT5 generally achieves higher patch accuracy and faster inference, especially on complex vulnerability patterns, while CodeBERT performs better on fragmented or context-limited code; both models’ performance declines as patch length increases. The work provides benchmarks and practical guidance for deploying automated vulnerability patching at scale and discusses future directions such as hybrid architectures and improved long-context handling.
Abstract
Software vulnerabilities pose critical security risks, demanding prompt and effective mitigation strategies. While advancements in Automated Program Repair (APR) have primarily targeted general software bugs, the domain of vulnerability patching, which is a security-critical subset of APR, remains underexplored. This paper investigates the potential of pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across diverse datasets and five programming languages. We evaluate these models on their accuracy, computational efficiency, and how the length of vulnerable code patches impacts performance. Our findings reveal promising accuracy levels, particularly for CodeT5 on datasets with complex vulnerability patterns, while CodeBERT demonstrates strengths in handling fragmented or context-limited datasets. CodeT5 further showcases superior efficiency, making it well-suited for large-scale applications. However, both models face challenges in maintaining performance as patch length increases, highlighting the complexity of addressing extended in program repair specifically aimed at fixing vulnerabilities. This study benchmarks model performance, highlights key limitations, and offers insights to improve automated vulnerability patching for practical security applications.
