Table of Contents
Fetching ...

LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

Hayder Elesedy, Pedro M. Esperança, Silviu Vlad Oprea, Mete Ozay

TL;DR

Guardrails for LLM content moderation often demand heavy guard-models that are impractical on resource-constrained devices. LoRA-Guard introduces a dual-path, parameter-efficient framework that shares core language understanding via LoRA adapters attached to the chat backbone, while a lightweight guard head outputs safe/unsafe judgments and harm-category logits; this preserves generation quality and yields 100–1000x guard-parameter savings. Trained on BeaverTails-30k and evaluated against ToxicChat and OpenAI Moderation, LoRA-Guard achieves competitive AUPRC with strong per-category signals, though generalization to out-of-distribution datasets like ToxicChat can be limited, highlighting dataset-shift risks. The approach enables on-device content moderation with practical support for resource-constrained deployments, where dual-use adapters can be activated for guarding and deactivated for generation without retraining the main model. Overall, LoRA-Guard demonstrates a significant step toward scalable, on-device safety for LLM-powered applications.

Abstract

Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.

LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

TL;DR

Guardrails for LLM content moderation often demand heavy guard-models that are impractical on resource-constrained devices. LoRA-Guard introduces a dual-path, parameter-efficient framework that shares core language understanding via LoRA adapters attached to the chat backbone, while a lightweight guard head outputs safe/unsafe judgments and harm-category logits; this preserves generation quality and yields 100–1000x guard-parameter savings. Trained on BeaverTails-30k and evaluated against ToxicChat and OpenAI Moderation, LoRA-Guard achieves competitive AUPRC with strong per-category signals, though generalization to out-of-distribution datasets like ToxicChat can be limited, highlighting dataset-shift risks. The approach enables on-device content moderation with practical support for resource-constrained deployments, where dual-use adapters can be activated for guarding and deactivated for generation without retraining the main model. Overall, LoRA-Guard demonstrates a significant step toward scalable, on-device safety for LLM-powered applications.

Abstract

Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
Paper Structure (27 sections, 3 figures, 6 tables)

This paper contains 27 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of LoRA-Guard, outlined in \ref{['section:methodology']}. The generative path uses only the chat model weights ($W$) to produce a response, while the guarding path uses both the chat weights and the guard adaptors ($W$ and $\Delta W$, respectively) to produce a harmfulness score. The system can guard the user prompt, the model response, or their concatenation ($\mathbin{+\mkern-10mu+}$).
  • Figure 2: Harmful content detection on BeaverTails-30k test set. LoRA-Guard performs on-par with or better than competing guard models, at 100-1000x reduction in guard parameters (additional to those needed to run the chat application being monitored). LoRA-Guard and LLaMA-Guard-2-8B-FT have been trained on BeaverTails-30k, while the LLaMA-Guard-3 models have not.
  • Figure 3: Prompt format for Self Defense baseline.