SQuaD: The Software Quality Dataset

Mikel Robredo; Matteo Esposito; Davide Taibi; Rafael Peñaloza; Valentina Lenarduzzi

SQuaD: The Software Quality Dataset

Mikel Robredo, Matteo Esposito, Davide Taibi, Rafael Peñaloza, Valentina Lenarduzzi

TL;DR

The paper addresses the need for a comprehensive, time-aware, multi-dimensional software quality dataset spanning multiple ecosystems. It presents SQuaD, assembled by integrating nine static analysis tools to extract 725 metrics across 450 mature projects and 63,586 releases, enriched with version-control, issue-tracking histories, and CVE/CWE vulnerability data. The methodology comprises four mining stages (VCS, SQua metrics, vulnerabilities, process metrics) and results in a MongoDB and CSV-based dataset with 14 release-process metrics and large-scale statistics. This dataset enables large-scale empirical studies on maintainability, technical debt, software evolution, and JIT defect prediction, and outlines directions for automatic updates and cross-project quality modeling.

Abstract

Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).

SQuaD: The Software Quality Dataset

TL;DR

Abstract

SQuaD: The Software Quality Dataset

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)