Table of Contents
Fetching ...

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

Biagio Montaruli, Luca Compagna, Serena Elisa Ponta, Davide Balzarotti

TL;DR

The paper tackles the challenge of detecting malicious Python packages by addressing two gaps in prior work: robustness to adversarial code transformations and adaptability to different stakeholders' false positive rates. It introduces a robust and adaptive detector that combines a comprehensive set of adversarial transformations with adversarial training, evaluated on MalwareBench and live PyPI feeds. Results show AT improves robustness by up to 2.5x and enables finding additional obfuscated packages, while case studies demonstrate practical deployment at 0.1% and 10% FPR for PyPI maintainers and enterprises, respectively. The work also provides a richer feature set (APIs, behaviors, obfuscation patterns) and releases code and data to foster further research, reporting 346 malicious packages to the community.

Abstract

The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community.

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

TL;DR

The paper tackles the challenge of detecting malicious Python packages by addressing two gaps in prior work: robustness to adversarial code transformations and adaptability to different stakeholders' false positive rates. It introduces a robust and adaptive detector that combines a comprehensive set of adversarial transformations with adversarial training, evaluated on MalwareBench and live PyPI feeds. Results show AT improves robustness by up to 2.5x and enables finding additional obfuscated packages, while case studies demonstrate practical deployment at 0.1% and 10% FPR for PyPI maintainers and enterprises, respectively. The work also provides a richer feature set (APIs, behaviors, obfuscation patterns) and releases code and data to foster further research, reporting 346 malicious packages to the community.

Abstract

The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community.

Paper Structure

This paper contains 23 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Proposed approach: we design a robust and adaptive detector of malicious Python packages, which can be tuned to the needs of different actors in the software supply chain, from PyPI maintainers to enterprise security teams.
  • Figure 2: Example of how to obfuscate the malicious payload (reverse shell) with the corresponding string encoded in Base64 (line #1), hexadecimal (line #2) and byte array representation (line #3), which is then decoded at runtime and executed using the os.system() function.
  • Figure 3: Example of how to split the malicious payload into multiple substrings and reorder them in several equivalent ways in Python.
  • Figure 4: Example of API obfuscation transformation to rewrite a module import, a method call, and a method reference using an alternative but semantically equivalent syntax.
  • Figure 5: Comparison between the baseline and the AT-based models on the live1 dataset in terms of detection of obfuscated (left) and non-obfuscated (right) samples.
  • ...and 1 more figures