PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan
TL;DR
PrimeGuard tackles the safety-utility trade-off in inference-time guardrailing by introducing a tuning-free routing framework that leverages two instantiations of the same LM (LLM_Main and LLM_Guard) to route queries according to risk. The approach uses Stage 1 risk-aware routing with ICL-generated guidance and Stage 2 conditional response generation, plus reevaluation for borderline cases, all without fine-tuning. A diverse safe-eval benchmark and rigorous judges enable cross-model evaluation, showing substantial gains in safety (e.g., safe responses rising from 61% to 97%) and maintained or improved helpfulness (average scores rising to 4.29) while drastically reducing jailbreak success rates (to 8%). The work demonstrates that tuning-free dynamic routing can outperform alignment-based baselines across model sizes, though benefits are attenuated for smaller models and rely on structured outputs, guiding future improvements in routing, evaluation, and broader red-teaming. The results suggest a practical path to deploying safer and more helpful LLMs without expensive re-alignment or fine-tuning. $P_{ ext_sys} = P_{ ext{directive}} oxplus P_{ ext{restrictive}}$, $P_{ ext total} = (P_{ ext sys}, I_{ ext usr})$, and $R \nolinebreak[4]~ olinebreak[4]~ ext{with } R olinebreak[4]~ olinebreak[4]~ ext{drawn from } p(R | P_{ ext total}).$
Abstract
Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.
