Table of Contents
Fetching ...

An Analysis of Malicious Packages in Open-Source Software in the Wild

Xiaoyan Zhou, Ying Zhang, Wenjia Niu, Jiqiang Liu, Haining Wang, Qiang Li

TL;DR

To address gaps in OSS malware research, the authors build the largest OSS malware dataset to date (24,356 packages) from diverse online sources and introduce MalGraph, a knowledge graph that encodes duplicated, similar, dependency, and co-existing edges to model attack campaigns and malware diversity. Their analysis reveals low cross-source overlap, substantial code reuse, a small set of dependency-hidden campaigns, and that security reports provide critical context for understanding campaigns. The work supplies practical tools for malware detection and analysis and offers the dataset and graph to the community to advance SSC security. Overall, this study demonstrates that comprehensive multi-source collection coupled with graph-based modeling is essential to understanding OSS malicious packages and improving defense strategies.

Abstract

The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis

An Analysis of Malicious Packages in Open-Source Software in the Wild

TL;DR

To address gaps in OSS malware research, the authors build the largest OSS malware dataset to date (24,356 packages) from diverse online sources and introduce MalGraph, a knowledge graph that encodes duplicated, similar, dependency, and co-existing edges to model attack campaigns and malware diversity. Their analysis reveals low cross-source overlap, substantial code reuse, a small set of dependency-hidden campaigns, and that security reports provide critical context for understanding campaigns. The work supplies practical tools for malware detection and analysis and offers the dataset and graph to the community to advance SSC security. Overall, this study demonstrates that comprehensive multi-source collection coupled with graph-based modeling is essential to understanding OSS malicious packages and improving defense strategies.

Abstract

The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis
Paper Structure (23 sections, 1 equation, 14 figures, 11 tables)

This paper contains 23 sections, 1 equation, 14 figures, 11 tables.

Figures (14)

  • Figure 1: A life cycle of a malicious package: (1) the development, (2) releasing malware to OSS, (3) detection, and (4) removing the malware.
  • Figure 2: Data collection methodology for OSS malicious packages. (1) We center around scattered online sources: open-source datasets, commercial websites, individual blogs, and social networks. (2) If a malicious package is available, we directly download it. If a malicious package is taken down by sources or OSS registries, we record its name and version. (3) We query the OSS registry mirrors to find unavailable packages based on names and versions.
  • Figure 3: MalGraph: one example of OSS malicious package group.
  • Figure 4: Example of a repeating attack Phylum_continue_npm: In August 2023, several subsequent malware packages were released in NPM. On August 9, the first malware package was released; on August 12, the ongoing campaign contained 6 different similar malware packages; between August 17 and 19, the attackers released 5 malware packages.
  • Figure 5: A dependent-hidden attack: the red line is the attack method based on the dependency relationship, and the blue line is the download process of users.
  • ...and 9 more figures