Table of Contents
Fetching ...

A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

Haya Samaana, Diego Elias Costa, Emad Shihab, Ahmad Abdellatif

TL;DR

The paper tackles the risk of malicious PyPI packages in software supply chains by proposing a data-driven, package-level detection framework that fuses metadata, code, file, and text features with static analysis. It evaluates six classifiers across eight feature sets on a large, carefully curated dataset of benign and malicious PyPI packages, achieving a standout F1 of 0.942 on training data and 0.90 on unseen test data using a stacking ensemble, with text vocabulary features contributing significantly. Key contributions include a vocabulary-based detection approach for PyPI, a public dataset of 5,331 packages, and a comparative evaluation against established tools, demonstrating reduced false positives and improved detection of ecosystem-specific threats. The methodology enables integration into package vetting pipelines to flag entire packages and lessen manual workload for registry maintainers, thereby enhancing PyPI ecosystem security and integrity.

Abstract

Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics to identify malicious packages. Results. In evaluations conducted within the PyPI ecosystem, we achieved an F1-measure of 0.94 for identifying malicious packages using a stacking ensemble classifier. Conclusions. This tool can be seamlessly integrated into package vetting pipelines and has the capability to flag entire packages, not just malicious function calls. This enhancement strengthens security measures and reduces the manual workload for developers and registry maintainers, thereby contributing to the overall integrity of the ecosystem.

A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

TL;DR

The paper tackles the risk of malicious PyPI packages in software supply chains by proposing a data-driven, package-level detection framework that fuses metadata, code, file, and text features with static analysis. It evaluates six classifiers across eight feature sets on a large, carefully curated dataset of benign and malicious PyPI packages, achieving a standout F1 of 0.942 on training data and 0.90 on unseen test data using a stacking ensemble, with text vocabulary features contributing significantly. Key contributions include a vocabulary-based detection approach for PyPI, a public dataset of 5,331 packages, and a comparative evaluation against established tools, demonstrating reduced false positives and improved detection of ecosystem-specific threats. The methodology enables integration into package vetting pipelines to flag entire packages and lessen manual workload for registry maintainers, thereby enhancing PyPI ecosystem security and integrity.

Abstract

Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics to identify malicious packages. Results. In evaluations conducted within the PyPI ecosystem, we achieved an F1-measure of 0.94 for identifying malicious packages using a stacking ensemble classifier. Conclusions. This tool can be seamlessly integrated into package vetting pipelines and has the capability to flag entire packages, not just malicious function calls. This enhancement strengthens security measures and reduces the manual workload for developers and registry maintainers, thereby contributing to the overall integrity of the ecosystem.

Paper Structure

This paper contains 13 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The workflow for identifying malicious packages.
  • Figure 2: The permutation feature importance.
  • Figure 3: The average number of alerts generated by bandit and packj tools from the whole package and setup.py file.
  • Figure 4: The result of the intersection approach ohm2022feasibility of three classifiers on our train and test dataset.