Table of Contents
Fetching ...

SLM as Guardian: Pioneering AI Safety with Small Language Models

Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo Park

TL;DR

This work tackles safety in large language models by placing safeguard responsibilities onto a small, cost-efficient sLLM that performs harmful-query detection and generates fluent safety responses. It introduces a multi-task framework with specialized tokens and a joint objective that ties detection to rationale-backed safe outputs, enabling real-time service routing for safe versus unsafe queries. The approach is demonstrated in Korean, using a two-stage training pipeline (SFT followed by safety-focused fine-tuning) and a large, curated dataset combining public Korean corpora with synthetic data, achieving performance competitive with or surpassing larger LLMs on multiple safety benchmarks. The contributions include a taxonomy of harmful queries, data-collection strategies, and a practical training recipe that can generalize to other low-resource languages while reducing deployment costs for safety features.

Abstract

Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.

SLM as Guardian: Pioneering AI Safety with Small Language Models

TL;DR

This work tackles safety in large language models by placing safeguard responsibilities onto a small, cost-efficient sLLM that performs harmful-query detection and generates fluent safety responses. It introduces a multi-task framework with specialized tokens and a joint objective that ties detection to rationale-backed safe outputs, enabling real-time service routing for safe versus unsafe queries. The approach is demonstrated in Korean, using a two-stage training pipeline (SFT followed by safety-focused fine-tuning) and a large, curated dataset combining public Korean corpora with synthetic data, achieving performance competitive with or surpassing larger LLMs on multiple safety benchmarks. The contributions include a taxonomy of harmful queries, data-collection strategies, and a practical training recipe that can generalize to other low-resource languages while reducing deployment costs for safety features.

Abstract

Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
Paper Structure (29 sections, 3 equations, 4 figures, 16 tables)

This paper contains 29 sections, 3 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: An example of intentionally forcing a safeguard response using a special token (<|harm|>). (More information at Appendix \ref{['app:harm-token']})
  • Figure 2: Overview of our proposed method. We first leverage off-the-shelf LLMs to gather answers to unsafe queries. We then use the question(Q), answer(A), and label(L) to train small task-specific safety models.
  • Figure 3: Even when the model deems the input query as safe, appending a special token <|harm|> can intentionally categorize the query as harmful, thereby eliciting a response oriented towards safety. Based on this, it is possible to variably apply safety policies without additional model parameter updates. This will help improve the stability of real-time services in terms of safety issue. In the figure, the left side represents a case where the input prompt is considered a safe inquiry and a response is provided, while the right side (actual model inference result) shows a forced evasion of the response intentionally.
  • Figure 4: As the overall dataset volume grows, the optimal ratio of safe queries to harmful queries (solid line) decreases and model performance increases.