One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

Biagio Montaruli; Luca Compagna; Serena Elisa Ponta; Davide Balzarotti

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

Biagio Montaruli, Luca Compagna, Serena Elisa Ponta, Davide Balzarotti

TL;DR

The paper tackles the challenge of detecting malicious Python packages by addressing two gaps in prior work: robustness to adversarial code transformations and adaptability to different stakeholders' false positive rates. It introduces a robust and adaptive detector that combines a comprehensive set of adversarial transformations with adversarial training, evaluated on MalwareBench and live PyPI feeds. Results show AT improves robustness by up to 2.5x and enables finding additional obfuscated packages, while case studies demonstrate practical deployment at 0.1% and 10% FPR for PyPI maintainers and enterprises, respectively. The work also provides a richer feature set (APIs, behaviors, obfuscation patterns) and releases code and data to foster further research, reporting 346 malicious packages to the community.

Abstract

The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community.

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

TL;DR

Abstract

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)