LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy, Pedro M. Esperança, Silviu Vlad Oprea, Mete Ozay
TL;DR
Guardrails for LLM content moderation often demand heavy guard-models that are impractical on resource-constrained devices. LoRA-Guard introduces a dual-path, parameter-efficient framework that shares core language understanding via LoRA adapters attached to the chat backbone, while a lightweight guard head outputs safe/unsafe judgments and harm-category logits; this preserves generation quality and yields 100–1000x guard-parameter savings. Trained on BeaverTails-30k and evaluated against ToxicChat and OpenAI Moderation, LoRA-Guard achieves competitive AUPRC with strong per-category signals, though generalization to out-of-distribution datasets like ToxicChat can be limited, highlighting dataset-shift risks. The approach enables on-device content moderation with practical support for resource-constrained deployments, where dual-use adapters can be activated for guarding and deactivated for generation without retraining the main model. Overall, LoRA-Guard demonstrates a significant step toward scalable, on-device safety for LLM-powered applications.
Abstract
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
