Table of Contents
Fetching ...

SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok

TL;DR

SGuard-v1 tackles the safety challenges of real-time human-LLM interactions by coupling two specialized guardrails: ContentFilter, a bilingual (English/Korean) multi-class classifier, and JailbreakFilter, a targeted adversarial-prompt detector. Built on a lightweight $2$B-parameter base, it leverages a rigorous data pipeline (seed data, Contextual Harm Translation, and Benign-Harmful Contextual Blending) and curriculum-based training to achieve strong safety performance with low memory overhead. Key contributions include a high-quality, bilingual safety dataset, a two-model architecture, and state-of-the-art results on public and proprietary benchmarks, all released under Apache-2.0 to facilitate research and deployment. The approach enables safer, scalable deployment of LLM services across platforms and languages, while maintaining interpretability through per-sample class labels and confidence scores.

Abstract

We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

SGuard-v1: Safety Guardrail for Large Language Models

TL;DR

SGuard-v1 tackles the safety challenges of real-time human-LLM interactions by coupling two specialized guardrails: ContentFilter, a bilingual (English/Korean) multi-class classifier, and JailbreakFilter, a targeted adversarial-prompt detector. Built on a lightweight B-parameter base, it leverages a rigorous data pipeline (seed data, Contextual Harm Translation, and Benign-Harmful Contextual Blending) and curriculum-based training to achieve strong safety performance with low memory overhead. Key contributions include a high-quality, bilingual safety dataset, a two-model architecture, and state-of-the-art results on public and proprietary benchmarks, all released under Apache-2.0 to facilitate research and deployment. The approach enables safer, scalable deployment of LLM services across platforms and languages, while maintaining interpretability through per-sample class labels and confidence scores.

Abstract

We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

Paper Structure

This paper contains 25 sections, 4 figures, 8 tables, 2 algorithms.

Figures (4)

  • Figure 1: (Best viewed in color) The schematic illustration of SGuard-v1: Harmful and adversarial prompts are screened by ContentFilter and JailbreakFilter while unsafe responses generated by LLMs are filtered by ContentFilter. $^*$The LLM image is generated by GPT-5.
  • Figure 2: The effect of Contextual Harm Translation.
  • Figure 3: Newly generated unsafe examples by BHCB.
  • Figure 4: (Best viewed in color) The effect of our curriculum learning. Through first-phase and second-phase training, the model gradually improves in FNR-FPR curve on the jailbreak detection benchmarks.