Table of Contents
Fetching ...

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

TL;DR

This paper tackles the problem of detecting malicious LLM prompts in production by addressing the trade-offs between performance, efficiency, and adaptability. It introduces BAGEL, a modular ensemble framework that uses small, specialized promptcops, a random-forest router for dynamic routing, and stochastic aggregation to achieve high detection accuracy with a fraction of the parameters of monolithic guardrails. The approach supports incremental updates by adding new promptcops trained on fresh attack datasets, without full-system retraining, while maintaining interpretability through the router's features. Across nine diverse datasets, BAGEL achieves a peak $F1$ of $0.92$ with only about $430$M effective parameters, outperforming black-box APIs and comparable white-box baselines, and remains robust as new attacks are introduced. The work demonstrates that an ensemble of compact classifiers with intelligent routing can deliver strong, efficient, and adaptable LLM safety in production settings, offering a sustainable alternative to giant guardrails.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

TL;DR

This paper tackles the problem of detecting malicious LLM prompts in production by addressing the trade-offs between performance, efficiency, and adaptability. It introduces BAGEL, a modular ensemble framework that uses small, specialized promptcops, a random-forest router for dynamic routing, and stochastic aggregation to achieve high detection accuracy with a fraction of the parameters of monolithic guardrails. The approach supports incremental updates by adding new promptcops trained on fresh attack datasets, without full-system retraining, while maintaining interpretability through the router's features. Across nine diverse datasets, BAGEL achieves a peak of with only about M effective parameters, outperforming black-box APIs and comparable white-box baselines, and remains robust as new attacks are introduced. The work demonstrates that an ensemble of compact classifiers with intelligent routing can deliver strong, efficient, and adaptable LLM safety in production settings, offering a sustainable alternative to giant guardrails.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.
Paper Structure (24 sections, 5 equations, 5 figures, 2 tables)

This paper contains 24 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of the three types of malicious prompt attacks. The malicious task has been redacted here to prevent direct inclusion of harmful content in this paper.
  • Figure 2: Overview of BAGEL, divided into two sections. The larger section details the ensemble selection strategy and probability aggregation employed during real-time incoming prompt classificatin. The smaller sections details the dataset partitioning, finetuning and addition of a new promptcop to the ensemble during system updates.
  • Figure 3: ASR and FPR performance curves across increasing selection size ($n$) for $k=9$.
  • Figure 4: Effects of modifying the selection size ($n$) on ASR and FPR while adding datasets (modifying $k$) over time.
  • Figure 5: Spearman Correlations of the Random Forest features and their resulting Hierarchical Clustering Diagram, showing relative correlations between cluster of features.