Table of Contents
Fetching ...

QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks

Sk Tanzir Mehedi, Raja Jurdak, Chadni Islam, Gowri Ramachandran

TL;DR

The paper tackles the lack of dynamic, install-time and post-install-time data for detecting next-gen software supply chain attacks in PyPI. It introduces QUT-DV25, a dynamic analysis dataset gathered in isolated sandboxes with eBPF probes, capturing 36 features across six trace types for 14,271 Python packages (7,127 malicious). It demonstrates that combining diverse eBPF-derived traces yields state-of-the-art detection performance (e.g., RF with CombinedTraces around 95–96% accuracy) and that dynamic traces outperform traditional metadata/static baselines. The dataset, along with code and instructions, provides a practical benchmark for developing robust, scalable defenses against evolving OSS supply chain threats, while acknowledging limitations such as Linux-centric tracing and ecosystem specificity.

Abstract

Securing software supply chains is a growing challenge due to the inadequacy of existing datasets in capturing the complexity of next-gen attacks, such as multiphase malware execution, remote access activation, and dynamic payload generation. Existing datasets, which rely on metadata inspection and static code analysis, are inadequate for detecting such attacks. This creates a critical gap because these datasets do not capture what happens during and after a package is installed. To address this gap, we present QUT-DV25, a dynamic analysis dataset specifically designed to support and advance research on detecting and mitigating supply chain attacks within the Python Package Index (PyPI) ecosystem. This dataset captures install and post-install-time traces from 14,271 Python packages, of which 7,127 are malicious. The packages are executed in an isolated sandbox environment using an extended Berkeley Packet Filter (eBPF) kernel and user-level probes. It captures 36 real-time features, that includes system calls, network traffic, resource usages, directory access patterns, dependency logs, and installation behaviors, enabling the study of next-gen attack vectors. ML analysis using the QUT-DV25 dataset identified four malicious PyPI packages previously labeled as benign, each with thousands of downloads. These packages deployed covert remote access and multi-phase payloads, were reported to PyPI maintainers, and subsequently removed. This highlights the practical value of QUT-DV25, as it outperforms reactive, metadata, and static datasets, offering a robust foundation for developing and benchmarking advanced threat detection within the evolving software supply chain ecosystem.

QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks

TL;DR

The paper tackles the lack of dynamic, install-time and post-install-time data for detecting next-gen software supply chain attacks in PyPI. It introduces QUT-DV25, a dynamic analysis dataset gathered in isolated sandboxes with eBPF probes, capturing 36 features across six trace types for 14,271 Python packages (7,127 malicious). It demonstrates that combining diverse eBPF-derived traces yields state-of-the-art detection performance (e.g., RF with CombinedTraces around 95–96% accuracy) and that dynamic traces outperform traditional metadata/static baselines. The dataset, along with code and instructions, provides a practical benchmark for developing robust, scalable defenses against evolving OSS supply chain threats, while acknowledging limitations such as Linux-centric tracing and ecosystem specificity.

Abstract

Securing software supply chains is a growing challenge due to the inadequacy of existing datasets in capturing the complexity of next-gen attacks, such as multiphase malware execution, remote access activation, and dynamic payload generation. Existing datasets, which rely on metadata inspection and static code analysis, are inadequate for detecting such attacks. This creates a critical gap because these datasets do not capture what happens during and after a package is installed. To address this gap, we present QUT-DV25, a dynamic analysis dataset specifically designed to support and advance research on detecting and mitigating supply chain attacks within the Python Package Index (PyPI) ecosystem. This dataset captures install and post-install-time traces from 14,271 Python packages, of which 7,127 are malicious. The packages are executed in an isolated sandbox environment using an extended Berkeley Packet Filter (eBPF) kernel and user-level probes. It captures 36 real-time features, that includes system calls, network traffic, resource usages, directory access patterns, dependency logs, and installation behaviors, enabling the study of next-gen attack vectors. ML analysis using the QUT-DV25 dataset identified four malicious PyPI packages previously labeled as benign, each with thousands of downloads. These packages deployed covert remote access and multi-phase payloads, were reported to PyPI maintainers, and subsequently removed. This highlights the practical value of QUT-DV25, as it outperforms reactive, metadata, and static datasets, offering a robust foundation for developing and benchmarking advanced threat detection within the evolving software supply chain ecosystem.

Paper Structure

This paper contains 15 sections, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: The isolated testbed configuration visualization for QUT-DV25.
  • Figure 2: The overall framework for collecting the QUT-DV25 dataset.
  • Figure 3: Training and validation (a) accuracies and (b) losses of the CNN model across epochs.