OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software
Zhuoran Tan, Christos Anagnosstopoulos, Jeremy Singer
TL;DR
This work addresses the gap in OSS security datasets by introducing OSPtrack, a labeled runtime dataset created through simulated execution of open-source packages across five ecosystems. Using a package-analysis sandbox, it collects eight runtime/static features and attaches attack-type sub-labels by matching against established malicious-package collections. The dataset comprises 9,461 reports (1,962 malicious; 7,499 benign) and supports cross-ecosystem runtime detection, model training, and comparative analysis, with data and code made public for reproducibility. It aims to strengthen OSS supply chain security by enabling runtime indicators to complement static analysis in detection efforts, while noting limitations and avenues for expansion.
Abstract
Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.
