SLM as Guardian: Pioneering AI Safety with Small Language Models
Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo Park
TL;DR
This work tackles safety in large language models by placing safeguard responsibilities onto a small, cost-efficient sLLM that performs harmful-query detection and generates fluent safety responses. It introduces a multi-task framework with specialized tokens and a joint objective that ties detection to rationale-backed safe outputs, enabling real-time service routing for safe versus unsafe queries. The approach is demonstrated in Korean, using a two-stage training pipeline (SFT followed by safety-focused fine-tuning) and a large, curated dataset combining public Korean corpora with synthetic data, achieving performance competitive with or surpassing larger LLMs on multiple safety benchmarks. The contributions include a taxonomy of harmful queries, data-collection strategies, and a practical training recipe that can generalize to other low-resource languages while reducing deployment costs for safety features.
Abstract
Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
