An Analysis of Malicious Packages in Open-Source Software in the Wild
Xiaoyan Zhou, Ying Zhang, Wenjia Niu, Jiqiang Liu, Haining Wang, Qiang Li
TL;DR
To address gaps in OSS malware research, the authors build the largest OSS malware dataset to date (24,356 packages) from diverse online sources and introduce MalGraph, a knowledge graph that encodes duplicated, similar, dependency, and co-existing edges to model attack campaigns and malware diversity. Their analysis reveals low cross-source overlap, substantial code reuse, a small set of dependency-hidden campaigns, and that security reports provide critical context for understanding campaigns. The work supplies practical tools for malware detection and analysis and offers the dataset and graph to the community to advance SSC security. Overall, this study demonstrates that comprehensive multi-source collection coupled with graph-based modeling is essential to understanding OSS malicious packages and improving defense strategies.
Abstract
The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis
