Table of Contents
Fetching ...

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

Xiaoyan Zhou, Feiran Liang, Zhaojie Xie, Yang Lan, Wenjia Niu, Jiqiang Liu, Haining Wang, Qiang Li

TL;DR

This paper addresses the security risks in open-source software (OSS) ecosystems by performing a large-scale empirical study of fine-grained information (FGI) across 50,000+ legitimate and 1,000 malicious packages from NPM, PyPI, and RubyGems. It defines FGI at three levels—metadata, static functions, and dynamic functions—and analyzes their differences between legitimate and malicious packages using AST-based static analysis and sandboxed dynamic tracing. The study finds that malicious packages generally have less metadata content, fewer static/dynamic functions, and a higher tendency to invoke HTTP/URL-related operations, with FGI serving as a discriminative signal for detection; however, combining all FGIs yields only modest gains over single-dimension signals. A malware-detection model based on FGI achieves high accuracy (around 95% when combining FGIs), highlighting the practical potential of FGIs for defense in OSS ecosystems. These findings offer actionable guidance for rapid screening and improved defenses against malicious OSS packages while acknowledging limitations in metadata spoofability and cross-ecosystem variability.

Abstract

Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

TL;DR

This paper addresses the security risks in open-source software (OSS) ecosystems by performing a large-scale empirical study of fine-grained information (FGI) across 50,000+ legitimate and 1,000 malicious packages from NPM, PyPI, and RubyGems. It defines FGI at three levels—metadata, static functions, and dynamic functions—and analyzes their differences between legitimate and malicious packages using AST-based static analysis and sandboxed dynamic tracing. The study finds that malicious packages generally have less metadata content, fewer static/dynamic functions, and a higher tendency to invoke HTTP/URL-related operations, with FGI serving as a discriminative signal for detection; however, combining all FGIs yields only modest gains over single-dimension signals. A malware-detection model based on FGI achieves high accuracy (around 95% when combining FGIs), highlighting the practical potential of FGIs for defense in OSS ecosystems. These findings offer actionable guidance for rapid screening and improved defenses against malicious OSS packages while acknowledging limitations in metadata spoofability and cross-ecosystem variability.

Abstract

Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.
Paper Structure (21 sections, 6 equations, 11 figures, 9 tables)

This paper contains 21 sections, 6 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: The OSS package extraction at the FGI level.
  • Figure 2: The CDF of the package description length.
  • Figure 3: The number of authors/maintainers for software packages.
  • Figure 4: The distribution of URL from software packages.
  • Figure 5: The CDF of dependencies of software packages.
  • ...and 6 more figures