Malicious Package Detection using Metadata Information

S. Halder; M. Bewong; A. Mahboubi; Y. Jiang; R. Islam; Z. Islam; R. Ip; E. Ahmed; G. Ramachandran; A. Babar

Malicious Package Detection using Metadata Information

S. Halder, M. Bewong, A. Mahboubi, Y. Jiang, R. Islam, Z. Islam, R. Ip, E. Ahmed, G. Ramachandran, A. Babar

TL;DR

This work tackles malicious package detection in software supply chains by leveraging metadata from package repositories like NPM and PyPI. It introduces MeMPtec, a metadata-based detector that separates features into easy-to-manipulate and difficult-to-manipulate categories, and uses monotonicity and restricted-control principles to bolster resilience against adversarial metadata tampering. Through extensive experiments on NPM-like data, MeMPtec demonstrably reduces false positives and false negatives across balanced and imbalanced datasets and proves robust to adversarial manipulation using SHAP-guided feature analysis. The approach enhances OSS ecosystem security by enabling fast, robust screening of packages prior to deployment, mitigating risk in modern software development.

Abstract

Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a software repository. For instance, JavaScript uses Node Package Manager (NPM), and Python uses Python Package Index (PyPi) as their respective package repositories. In the past, NPM has had vulnerabilities such as the event-stream incident, where a malicious package was introduced into a popular NPM package, potentially impacting a wide range of projects. As the integration of third-party packages becomes increasingly ubiquitous in modern software development, accelerating the creation and deployment of applications, the need for a robust detection mechanism has become critical. On the other hand, due to the sheer volume of new packages being released daily, the task of identifying malicious packages presents a significant challenge. To address this issue, in this paper, we introduce a metadata-based malicious package detection model, MeMPtec. This model extracts a set of features from package metadata information. These extracted features are classified as either easy-to-manipulate (ETM) or difficult-to-manipulate (DTM) features based on monotonicity and restricted control properties. By utilising these metadata features, not only do we improve the effectiveness of detecting malicious packages, but also we demonstrate its resistance to adversarial attacks in comparison with existing state-of-the-art. Our experiments indicate a significant reduction in both false positives (up to 97.56%) and false negatives (up to 91.86%).

Malicious Package Detection using Metadata Information

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 3 equations, 7 figures, 6 tables, 2 algorithms.

Introduction
Existing Works
Differences with Previous Works
Preliminaries & Problem Statement
Categorisation of Package Metadata Information
Feature Extraction and Selection
Easy-to-Manipulate Features
Difficult-to-Manipulate Features
Proposed MeMPtec Model
Experiments
Experimental Setup
Datasets and Baseline Methods
Machine Learning/Deep Learning Techniques
Evaluation Metrics
Performance Evaluation of MeMPtec (RQ1)
...and 8 more sections

Figures (7)

Figure 1: Proposed Metadata-based Malicious Package Detection (MeMPtec) model architecture.
Figure 2: False Positive and False Negative numbers comparison on balanced and imbalanced datasets.
Figure 3: Performance analyses of various models wrt feature manipulation.
Figure 4: Performance analyses of MeMPtec wrt monotonic property (temporal information and package interaction).
Figure 5: Performance analyses of various models wrt TOP-N significant feature manipulation.
...and 2 more figures

Theorems & Definitions (4)

Definition 1: Package Metadata Information (PMI)
Definition 2: Problem Definition
Definition 3: Feature Extractor $\mathcal{F}$
Definition 4: Adversary

Malicious Package Detection using Metadata Information

TL;DR

Abstract

Malicious Package Detection using Metadata Information

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (4)