Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs
Zhiyang Chen, Tara Saba, Xun Deng, Xujie Si, Fan Long
TL;DR
This work addresses the risk that production LLMs can generate malicious code in response to innocuous developer prompts, arising from contaminated training data. It proposes Scam2Prompt, an automated auditing framework that harvests malicious URLs, synthesizes developer-like prompts, generates code with URL extraction, and uses a multi-model oracle plus human review to identify innocuous prompts that trigger malicious outputs; it further releases Innoc2Scam-bench, a benchmark of 1,559 prompts. Empirical results show a non-trivial incidence of malicious code: an average $4.24\%$ malicious outputs across four 2024 LLMs, escalating to $12.7\%$–$43.8\%$ across seven 2025 models, with guardrails failing to detect most cases ($<0.3\%$). The findings reveal a persistent data-poisoning vulnerability across industry models, underscoring the need for stronger training-data sanitization, code-specific safety guardrails, and runtime monitoring, and the authors provide resources to enable ongoing evaluation and mitigation.
Abstract
Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.
