SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang
TL;DR
SafeRoute introduces an adaptive binary router that selectively routes prompt–response pairs to a larger safety guard model only when hard examples warrant it, while handling easy cases with a smaller guard. The approach is trained using augmented datasets that label instances where the large model adds value, and it uses a last-token feature from the small guard plus a Bayesian neural network router to decide when to deploy the large guard. Theoretical risk guarantees relate the adaptive system to an oracle with a tunable bound, and empirical results across six benchmarks show significant improvements in the safety-F1 vs. latency trade-off compared to baselines, with robust performance on both in-distribution and out-of-distribution data. The work demonstrates that adaptive model selection can meaningfully reduce compute costs without sacrificing safety performance in real-world LLM deployments.
Abstract
Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.
