Table of Contents
Fetching ...

MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code

Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, Barbara Russo

TL;DR

MADE-WIC tackles the problem of heterogeneous and biased annotations across public code-data by fusing three existing datasets into a unified, richly annotated schema that covers vulnerability, technical debt, and security concerns. It introduces multiple annotation approaches (vf, W, MAT, PS, SecI) and provides a practical fusion pipeline that extracts data from annotated datasets and open-source repos, removes duplicates, and stores results in three CSVs with leading comments. The work offers a versatile platform for benchmarking tools, fine-tuning models like CodeBERT, and studying the impact of annotation choices on detection performance, while also detailing data quality metrics and limitations. This dataset enables bias-controlled experimentation and cross-dataset comparisons, with broad implications for developing robust software-weakness detection tools in C/C++ ecosystems.

Abstract

In this paper, we present MADE-WIC, a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses leveraging different state-of-the-art approaches. It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects. To the best of our knowledge, no such dataset is publicly available. MADE-WIC aims to provide researchers with a curated dataset on which to test and compare tools designed for the detection of code weaknesses and technical debt. As we have fused existing datasets, researchers have the possibility to evaluate the performance of their tools by also controlling the bias related to the annotation definition and dataset construction. The demonstration video can be retrieved at https://www.youtube.com/watch?v=GaQodPrcb6E.

MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code

TL;DR

MADE-WIC tackles the problem of heterogeneous and biased annotations across public code-data by fusing three existing datasets into a unified, richly annotated schema that covers vulnerability, technical debt, and security concerns. It introduces multiple annotation approaches (vf, W, MAT, PS, SecI) and provides a practical fusion pipeline that extracts data from annotated datasets and open-source repos, removes duplicates, and stores results in three CSVs with leading comments. The work offers a versatile platform for benchmarking tools, fine-tuning models like CodeBERT, and studying the impact of annotation choices on detection performance, while also detailing data quality metrics and limitations. This dataset enables bias-controlled experimentation and cross-dataset comparisons, with broad implications for developing robust software-weakness detection tools in C/C++ ecosystems.

Abstract

In this paper, we present MADE-WIC, a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses leveraging different state-of-the-art approaches. It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects. To the best of our knowledge, no such dataset is publicly available. MADE-WIC aims to provide researchers with a curated dataset on which to test and compare tools designed for the detection of code weaknesses and technical debt. As we have fused existing datasets, researchers have the possibility to evaluate the performance of their tools by also controlling the bias related to the annotation definition and dataset construction. The demonstration video can be retrieved at https://www.youtube.com/watch?v=GaQodPrcb6E.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Differences in number and percentage of SATD instances in Ren et al.Ren2019TOSEM and Guo et al.Guo2021 on the same set of comments Maldonado2017
  • Figure 2: Differences in performance reported by Guo et al. of the Ren et al. approach Guo2021 and in the original work of Ren et al.Ren2019TOSEM
  • Figure 3: Fusion approach to generate MADE-WIC, extracting the information from existing datasets and open source projects.