Table of Contents
Fetching ...

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries

TL;DR

The paper tackles the problem of unreliable LLM security evaluations by identifying confounding variables and uncertain outcomes as core issues. It introduces a principled end-to-end framework that combines experimental designs (training- and performance-based grouping) with a Bayesian hierarchical model that uses embedding-space clustering to quantify uncertainty and reduce prompt bias. Through a case study comparing Mamba and Transformer architectures, the approach demonstrates improved reliability in inference and reveals architecture-dependent vulnerabilities that vary with attacks and grouping strategy. The work offers a scalable, uncertainty-aware methodology for practical LLM security assessment and can extend to broader prompt-based evaluation tasks and safety-alignment analyses.

Abstract

Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be difficult to trust, often drawing conclusions from LLMs that are not meaningfully comparable, relying on heuristic inputs or employing metrics that fail to capture the inherent uncertainty. In this paper, we propose a principled and practical end-to-end framework for evaluating LLM vulnerabilities to prompt injection attacks. First, we propose practical approaches to experimental design, tackling unfair LLM comparisons by considering two practitioner scenarios: when training an LLM and when deploying a pre-trained LLM. Second, we address the analysis of experiments and propose a Bayesian hierarchical model with embedding-space clustering. This model is designed to improve uncertainty quantification in the common scenario that LLM outputs are not deterministic, test prompts are designed imperfectly, and practitioners only have a limited amount of compute to evaluate vulnerabilities. We show the improved inferential capabilities of the model in several prompt injection attack settings. Finally, we demonstrate the pipeline to evaluate the security of Transformer versus Mamba architectures. Our findings show that consideration of output variability can suggest less definitive findings. However, for some attacks, we find notably increased Transformer and Mamba-variant vulnerabilities across LLMs with the same training data or mathematical ability.

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

TL;DR

The paper tackles the problem of unreliable LLM security evaluations by identifying confounding variables and uncertain outcomes as core issues. It introduces a principled end-to-end framework that combines experimental designs (training- and performance-based grouping) with a Bayesian hierarchical model that uses embedding-space clustering to quantify uncertainty and reduce prompt bias. Through a case study comparing Mamba and Transformer architectures, the approach demonstrates improved reliability in inference and reveals architecture-dependent vulnerabilities that vary with attacks and grouping strategy. The work offers a scalable, uncertainty-aware methodology for practical LLM security assessment and can extend to broader prompt-based evaluation tasks and safety-alignment analyses.

Abstract

Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be difficult to trust, often drawing conclusions from LLMs that are not meaningfully comparable, relying on heuristic inputs or employing metrics that fail to capture the inherent uncertainty. In this paper, we propose a principled and practical end-to-end framework for evaluating LLM vulnerabilities to prompt injection attacks. First, we propose practical approaches to experimental design, tackling unfair LLM comparisons by considering two practitioner scenarios: when training an LLM and when deploying a pre-trained LLM. Second, we address the analysis of experiments and propose a Bayesian hierarchical model with embedding-space clustering. This model is designed to improve uncertainty quantification in the common scenario that LLM outputs are not deterministic, test prompts are designed imperfectly, and practitioners only have a limited amount of compute to evaluate vulnerabilities. We show the improved inferential capabilities of the model in several prompt injection attack settings. Finally, we demonstrate the pipeline to evaluate the security of Transformer versus Mamba architectures. Our findings show that consideration of output variability can suggest less definitive findings. However, for some attacks, we find notably increased Transformer and Mamba-variant vulnerabilities across LLMs with the same training data or mathematical ability.

Paper Structure

This paper contains 26 sections, 14 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualisation of the prompts from the Package hallucination (JavaScript) attack Garak_2024. From left to right: (1) embedding when PCA is applied, (2) embedding when t-SNE is applied, and (3) shows the percentage of variance explained by each PC.
  • Figure 2: Cluster label PSM for the Package hallucination (JavaScript) attack and 2.8b parameter Mamba model. The labels on the x axis are in the same order as those on the y axis. For presentation, the presented label orders are found via hierarchical clustering with average linkage, but this does not change the data presented.
  • Figure 3: Results for each attack, showing average posterior means and $90\%$ credible intervals. The x-axis labels can be cross-referenced with Table \ref{['tab:models']}. The run time for each LLM and attack combination, and 10000 importance samples is approximately 5 minutes on a single CPU.
  • Figure 4: Results for each attack versus accuracy, showing average posterior means and 90% credible intervals. The legend can be cross-referenced with Table \ref{['tab:models']}. The run time for each LLM and attack combination, and 10000 importance samples is approximately 5 minutes on a single CPU.
  • Figure 5: Visualisation of the prompts from each attack. From left to right: (1) embedding when PCA is applied, (2) embedding when t-SNE is applied, and (3) the scree plot.
  • ...and 2 more figures