MVD: A Multi-Lingual Software Vulnerability Detection Framework
Boyu Zhang, Triet H. M. Le, M. Ali Babar
TL;DR
The paper tackles the challenge of detecting software vulnerabilities across multiple programming languages by introducing MVD, a CodeBERT-based framework that jointly learns vulnerability patterns from six languages and supports incremental learning to extend to new languages. MVD uses a multi-class training objective and a hybrid FOLA loss to handle language-imbalanced data, while an incremental learning module with distillation preserves prior knowledge when adding new languages. Experimental results on a curated dataset of over 11K real-world vulnerabilities show substantial improvements over single-language baselines and demonstrate effective extension to new languages without eroding performance on existing ones. The work advances practical vulnerability prediction for polyglot software ecosystems and provides data and code for future research.
Abstract
Software vulnerabilities can result in catastrophic cyberattacks that increasingly threaten business operations. Consequently, ensuring the safety of software systems has become a paramount concern for both private and public sectors. Recent literature has witnessed increasing exploration of learning-based approaches for software vulnerability detection. However, a key limitation of these techniques is their primary focus on a single programming language, such as C/C++, which poses constraints considering the polyglot nature of modern software projects. Further, there appears to be an oversight in harnessing the synergies of vulnerability knowledge across varied languages, potentially underutilizing the full capabilities of these methods. To address the aforementioned issues, we introduce MVD - an innovative multi-lingual vulnerability detection framework. This framework acquires the ability to detect vulnerabilities across multiple languages by concurrently learning from vulnerability data of various languages, which are curated by our specialized pipeline. We also incorporate incremental learning to enable the detection capability of MVD to be extended to new languages, thus augmenting its practical utility. Extensive experiments on our curated dataset of more than 11K real-world multi-lingual vulnerabilities substantiate that our framework significantly surpasses state-of-the-art methods in multi-lingual vulnerability detection by 83.7% to 193.6% in PR-AUC. The results also demonstrate that MVD detects vulnerabilities well for new languages without compromising the detection performance of previously trained languages, even when training data for the older languages is unavailable. Overall, our findings motivate and pave the way for the prediction of multi-lingual vulnerabilities in modern software systems.
