Table of Contents
Fetching ...

On The Dangers of Poisoned LLMs In Security Automation

Patrick Karlsen, Even Eilertsen

TL;DR

The paper addresses the risk of LLM poisoning in security automation by experimentally demonstrating a targeted backdoor that causes a fine-tuned model to dismiss true-positive security alerts from a specific user. Using Llama-3.1-8B and Qwen-3-4B, the authors show that poisoning can improve clean-test performance while producing a 100% misclassification rate on a poison-test set, effectively creating a persistent blind spot. They modify the models with a minimalist binary-class head to simulate alert classification and compare baseline, clean-finetuned, and poisoned-finetuned states to quantify the attack’s impact. The work highlights the need for robust provenance, risk assessment, and defense-in-depth in security deployments, especially for on-premise systems, and suggests concrete mitigation and future research directions to prevent such backdoors from compromising critical infrastructure.

Abstract

This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.

On The Dangers of Poisoned LLMs In Security Automation

TL;DR

The paper addresses the risk of LLM poisoning in security automation by experimentally demonstrating a targeted backdoor that causes a fine-tuned model to dismiss true-positive security alerts from a specific user. Using Llama-3.1-8B and Qwen-3-4B, the authors show that poisoning can improve clean-test performance while producing a 100% misclassification rate on a poison-test set, effectively creating a persistent blind spot. They modify the models with a minimalist binary-class head to simulate alert classification and compare baseline, clean-finetuned, and poisoned-finetuned states to quantify the attack’s impact. The work highlights the need for robust provenance, risk assessment, and defense-in-depth in security deployments, especially for on-premise systems, and suggests concrete mitigation and future research directions to prevent such backdoors from compromising critical infrastructure.

Abstract

This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.

Paper Structure

This paper contains 17 sections, 1 figure.

Figures (1)

  • Figure 1: Agentic security investigation loop