Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

Motunrayo Ibiyo; Thinakone Louangdy; Phuong T. Nguyen; Claudio Di Sipio; Davide Di Ruscio

Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

Motunrayo Ibiyo, Thinakone Louangdy, Phuong T. Nguyen, Claudio Di Sipio, Davide Di Ruscio

TL;DR

This paper tackles the challenge of detecting malicious PyPI packages using a combination of LLMs, Retrieval-Augmented Generation (RAG), and few-shot learning. It compares baseline zero-shot LLM performance with RAG-configured variants based on YARA rules, GitHub advisories, and malicious setup.py content, and shows that RAG does not improve detection relative to strong baselines. A significant finding is that fine-tuning an open-source LLaMA-3.1-8B model on AST-derived behavioral descriptions achieves near-perfect results, with $Acc=97\%$ and $BalAcc=95\%$, vastly outperforming the RAG configurations. The work highlights the critical importance of high-quality, structured threat knowledge bases and points toward hybrid approaches that combine fine-tuning with retrieval, together with expanded datasets, to enhance detection of malicious code in open-source ecosystems.

Abstract

Malicious software packages in open-source ecosystems, such as PyPI, pose growing security risks. Unlike traditional vulnerabilities, these packages are intentionally designed to deceive users, making detection challenging due to evolving attack methods and the lack of structured datasets. In this work, we empirically evaluate the effectiveness of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and few-shot learning for detecting malicious source code. We fine-tune LLMs on curated datasets and integrate YARA rules, GitHub Security Advisories, and malicious code snippets with the aim of enhancing classification accuracy. We came across a counterintuitive outcome: While RAG is expected to boost up the prediction performance, it fails in the performed evaluation, obtaining a mediocre accuracy. In contrast, few-shot learning is more effective as it significantly improves the detection of malicious code, achieving 97% accuracy and 95% balanced accuracy, outperforming traditional RAG approaches. Thus, future work should expand structured knowledge bases, refine retrieval models, and explore hybrid AI-driven cybersecurity solutions.

Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

TL;DR

Abstract

Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)