Table of Contents
Fetching ...

Embedding-based classifiers can detect prompt injection attacks

Md. Ahsan Ayub, Subhabrata Majumdar

TL;DR

This work addresses the vulnerability of large language models to prompt injection by proposing embedding-based classifiers that operate on prompts converted into dense vector representations from three embedding models. The authors systematically evaluate traditional ML classifiers (Logistic Regression, XGBoost, Random Forest) on these embeddings and demonstrate that Random Forest with OpenAI embeddings yields the strongest performance (AUC up to 0.764, precision ~0.867, recall ~0.870), outperforming several open-source prompt-injection detectors. A 467,057-prompt dataset with 23.54% malicious samples underpins the analysis, alongside dimensionality-reduction visualizations that reveal no clear linear separation but motivate non-linear classifiers. The work provides public code and data, highlighting the viability of embedding-based approaches for prompt-injection defense and guiding future exploration into neural detectors and broader attack vectors.

Abstract

Large Language Models (LLMs) are seeing significant adoption in every type of organization due to their exceptional generative capabilities. However, LLMs are found to be vulnerable to various adversarial attacks, particularly prompt injection attacks, which trick them into producing harmful or inappropriate content. Adversaries execute such attacks by crafting malicious prompts to deceive the LLMs. In this paper, we propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat. We leverage three commonly used embedding models to generate embeddings of malicious and benign prompts and utilize ML classifiers to predict whether an input prompt is malicious. Out of several traditional ML methods, we achieve the best performance with classifiers built using Random Forest and XGBoost. Our classifiers outperform state-of-the-art prompt injection classifiers available in open-source implementations, which use encoder-only neural networks.

Embedding-based classifiers can detect prompt injection attacks

TL;DR

This work addresses the vulnerability of large language models to prompt injection by proposing embedding-based classifiers that operate on prompts converted into dense vector representations from three embedding models. The authors systematically evaluate traditional ML classifiers (Logistic Regression, XGBoost, Random Forest) on these embeddings and demonstrate that Random Forest with OpenAI embeddings yields the strongest performance (AUC up to 0.764, precision ~0.867, recall ~0.870), outperforming several open-source prompt-injection detectors. A 467,057-prompt dataset with 23.54% malicious samples underpins the analysis, alongside dimensionality-reduction visualizations that reveal no clear linear separation but motivate non-linear classifiers. The work provides public code and data, highlighting the viability of embedding-based approaches for prompt-injection defense and guiding future exploration into neural detectors and broader attack vectors.

Abstract

Large Language Models (LLMs) are seeing significant adoption in every type of organization due to their exceptional generative capabilities. However, LLMs are found to be vulnerable to various adversarial attacks, particularly prompt injection attacks, which trick them into producing harmful or inappropriate content. Adversaries execute such attacks by crafting malicious prompts to deceive the LLMs. In this paper, we propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat. We leverage three commonly used embedding models to generate embeddings of malicious and benign prompts and utilize ML classifiers to predict whether an input prompt is malicious. Out of several traditional ML methods, we achieve the best performance with classifiers built using Random Forest and XGBoost. Our classifiers outperform state-of-the-art prompt injection classifiers available in open-source implementations, which use encoder-only neural networks.

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A schematic diagram of prompts and their embeddings.
  • Figure 2: Visualization of OpenAI (Left), GTE (Middle), and MiniLM (Right) embedding distribution after applying Principal Components Analysis (PCA).
  • Figure 3: Visualization of OpenAI (Left), GTE (Middle), and MiniLM (Right) embedding distribution after applying T-distributed Stochastic Neighbor Embedding (t-SNE).
  • Figure 4: Visualization of OpenAI (Left), GTE (Middle), and MiniLM (Right) embedding distribution after applying Uniform Manifold Approximation and Projection (UMAP).