Table of Contents
Fetching ...

Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings

Aaron Zheng, Mansi Rana, Andreas Stolcke

TL;DR

The paper tackles the need for lightweight, cost-efficient guardrails for LLM deployments by fine-tuning a Sentence-BERT embedding model to perform safe/unsafe prompt classification. It demonstrates that a 67M-parameter model can achieve performance on par with much larger guards (e.g., LlamaGuard 7B) on the AEGIS safety benchmark, with latency around $0.05$ seconds and significantly lower compute requirements. The authors explore multiple training and deployment configurations, with the McEMcC setup (multi-embedding, multi-class classification) and triplet-soft loss delivering the best results (accuracy of $88.83\%$, AUPRC $0.946$, F1 $0.89$). The work highlights the practicality of embedding-based guardrails for cost-constrained environments and discusses avenues for improvement, including multilinguality, multimodal inputs, and topic-based few-shot filtering. Overall, the approach offers a scalable, efficient alternative to heavy LLM-based guardrails while maintaining strong safety performance.

Abstract

With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM's behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users' expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI's MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.

Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings

TL;DR

The paper tackles the need for lightweight, cost-efficient guardrails for LLM deployments by fine-tuning a Sentence-BERT embedding model to perform safe/unsafe prompt classification. It demonstrates that a 67M-parameter model can achieve performance on par with much larger guards (e.g., LlamaGuard 7B) on the AEGIS safety benchmark, with latency around seconds and significantly lower compute requirements. The authors explore multiple training and deployment configurations, with the McEMcC setup (multi-embedding, multi-class classification) and triplet-soft loss delivering the best results (accuracy of , AUPRC , F1 ). The work highlights the practicality of embedding-based guardrails for cost-constrained environments and discusses avenues for improvement, including multilinguality, multimodal inputs, and topic-based few-shot filtering. Overall, the approach offers a scalable, efficient alternative to heavy LLM-based guardrails while maintaining strong safety performance.

Abstract

With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM's behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users' expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI's MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.

Paper Structure

This paper contains 23 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Sentence transformer architecture. Left: training. Right: inference.
  • Figure 2: Precision-recall curve for our best model