MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code
Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, Barbara Russo
TL;DR
MADE-WIC tackles the problem of heterogeneous and biased annotations across public code-data by fusing three existing datasets into a unified, richly annotated schema that covers vulnerability, technical debt, and security concerns. It introduces multiple annotation approaches (vf, W, MAT, PS, SecI) and provides a practical fusion pipeline that extracts data from annotated datasets and open-source repos, removes duplicates, and stores results in three CSVs with leading comments. The work offers a versatile platform for benchmarking tools, fine-tuning models like CodeBERT, and studying the impact of annotation choices on detection performance, while also detailing data quality metrics and limitations. This dataset enables bias-controlled experimentation and cross-dataset comparisons, with broad implications for developing robust software-weakness detection tools in C/C++ ecosystems.
Abstract
In this paper, we present MADE-WIC, a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses leveraging different state-of-the-art approaches. It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects. To the best of our knowledge, no such dataset is publicly available. MADE-WIC aims to provide researchers with a curated dataset on which to test and compare tools designed for the detection of code weaknesses and technical debt. As we have fused existing datasets, researchers have the possibility to evaluate the performance of their tools by also controlling the bias related to the annotation definition and dataset construction. The demonstration video can be retrieved at https://www.youtube.com/watch?v=GaQodPrcb6E.
