A Data-Mining Based Study of Security Vulnerability Types and Their Mitigation in Different Languages
Gábor Antal, Balázs Mosolygó, Norbert Vándor, Péter Hegedüs
TL;DR
This study addresses how security vulnerability types and their remediation differ across programming languages by mining CVE/CWE signals from commit logs in nine languages. A data-mining pipeline built from CVE Manager, Git Log Parser, and CVE Miner collects CVEs mentioned in commits, estimates fix times, tracks contributor activity, and measures code changes to produce cross-language vulnerability statistics. Key findings reveal language-dependent CWE distributions and remediation patterns, such as CWE-119 in C++ and CWE-79 in Ruby, with larger projects often showing longer fix cycles and CVE reoccurrence, highlighting how ecosystem and scale shape security practices. The work demonstrates a scalable method to quantify cross-language security activity and provides a reference framework for researchers and developers, while acknowledging limitations due to sample size and reliance on commit-message indicators.
Abstract
The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these languages. Analyzing the more popular ones, we found that the same security issues might appear differently in different languages, and as such the provided solutions may vary just as much. We also found that projects with similar sizes can produce extremely different results, and have different common weaknesses, even if they provide a solution to the same task. These statistics may not be entirely indicative of the projects' standards when it comes to security, but they provide a good reference point of what one should expect. Given a larger sample size they could be made even more precise, and as such a better understanding of the security relevant activities within the projects written in given languages could be achieved.
