Table of Contents
Fetching ...

OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software

Zhuoran Tan, Christos Anagnosstopoulos, Jeremy Singer

TL;DR

This work addresses the gap in OSS security datasets by introducing OSPtrack, a labeled runtime dataset created through simulated execution of open-source packages across five ecosystems. Using a package-analysis sandbox, it collects eight runtime/static features and attaches attack-type sub-labels by matching against established malicious-package collections. The dataset comprises 9,461 reports (1,962 malicious; 7,499 benign) and supports cross-ecosystem runtime detection, model training, and comparative analysis, with data and code made public for reproducibility. It aims to strengthen OSS supply chain security by enabling runtime indicators to complement static analysis in detection efforts, while noting limitations and avenues for expansion.

Abstract

Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.

OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software

TL;DR

This work addresses the gap in OSS security datasets by introducing OSPtrack, a labeled runtime dataset created through simulated execution of open-source packages across five ecosystems. Using a package-analysis sandbox, it collects eight runtime/static features and attaches attack-type sub-labels by matching against established malicious-package collections. The dataset comprises 9,461 reports (1,962 malicious; 7,499 benign) and supports cross-ecosystem runtime detection, model training, and comparative analysis, with data and code made public for reproducibility. It aims to strengthen OSS supply chain security by enabling runtime indicators to complement static analysis in detection efforts, while noting limitations and avenues for expansion.

Abstract

Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.

Paper Structure

This paper contains 13 sections, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: The Data Generation Framework. 1. Collect package information 1.a. Query analyzed results from BigQuery 2. Simulate packages with package-analysis in multiple processes 3. Parse JSON reports and also queried Parquet reports, extract features 4. Match and generate labels based on known labels