Table of Contents
Fetching ...

A Data-Mining Based Study of Security Vulnerability Types and Their Mitigation in Different Languages

Gábor Antal, Balázs Mosolygó, Norbert Vándor, Péter Hegedüs

TL;DR

This study addresses how security vulnerability types and their remediation differ across programming languages by mining CVE/CWE signals from commit logs in nine languages. A data-mining pipeline built from CVE Manager, Git Log Parser, and CVE Miner collects CVEs mentioned in commits, estimates fix times, tracks contributor activity, and measures code changes to produce cross-language vulnerability statistics. Key findings reveal language-dependent CWE distributions and remediation patterns, such as CWE-119 in C++ and CWE-79 in Ruby, with larger projects often showing longer fix cycles and CVE reoccurrence, highlighting how ecosystem and scale shape security practices. The work demonstrates a scalable method to quantify cross-language security activity and provides a reference framework for researchers and developers, while acknowledging limitations due to sample size and reliance on commit-message indicators.

Abstract

The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these languages. Analyzing the more popular ones, we found that the same security issues might appear differently in different languages, and as such the provided solutions may vary just as much. We also found that projects with similar sizes can produce extremely different results, and have different common weaknesses, even if they provide a solution to the same task. These statistics may not be entirely indicative of the projects' standards when it comes to security, but they provide a good reference point of what one should expect. Given a larger sample size they could be made even more precise, and as such a better understanding of the security relevant activities within the projects written in given languages could be achieved.

A Data-Mining Based Study of Security Vulnerability Types and Their Mitigation in Different Languages

TL;DR

This study addresses how security vulnerability types and their remediation differ across programming languages by mining CVE/CWE signals from commit logs in nine languages. A data-mining pipeline built from CVE Manager, Git Log Parser, and CVE Miner collects CVEs mentioned in commits, estimates fix times, tracks contributor activity, and measures code changes to produce cross-language vulnerability statistics. Key findings reveal language-dependent CWE distributions and remediation patterns, such as CWE-119 in C++ and CWE-79 in Ruby, with larger projects often showing longer fix cycles and CVE reoccurrence, highlighting how ecosystem and scale shape security practices. The work demonstrates a scalable method to quantify cross-language security activity and provides a reference framework for researchers and developers, while acknowledging limitations due to sample size and reliance on commit-message indicators.

Abstract

The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these languages. Analyzing the more popular ones, we found that the same security issues might appear differently in different languages, and as such the provided solutions may vary just as much. We also found that projects with similar sizes can produce extremely different results, and have different common weaknesses, even if they provide a solution to the same task. These statistics may not be entirely indicative of the projects' standards when it comes to security, but they provide a good reference point of what one should expect. Given a larger sample size they could be made even more precise, and as such a better understanding of the security relevant activities within the projects written in given languages could be achieved.
Paper Structure (19 sections, 6 figures, 2 tables)

This paper contains 19 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A schematic representation of our miner
  • Figure 2: The average time elapsed in days between finding and fixing a CVE
  • Figure 3: The average time elapsed between the publication and fixing of a cve represanted in days
  • Figure 4: The correlation between the base score(severity) and time taken fixing the cve
  • Figure 5: The average number of contributors between the finding and fixing commit
  • ...and 1 more figures