Table of Contents
Fetching ...

BingoGuard: LLM Content Moderation Tools with Risk Levels

Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu

TL;DR

The paper tackles the insufficiency of binary LLM content moderation by introducing a per-topic severity taxonomy and a generate-then-filter data synthesis framework to train a multi-task moderator that outputs both safety labels and severity levels. It presents BingoGuard, an 8B-class model trained via supervised fine-tuning on BingoGuardTrain, achieving state-of-the-art results on BingoGuardTest and public benchmarks, and reveals that explicit severity supervision improves both accuracy and risk calibration. The work provides BingoGuardTrain and BingoGuardTest datasets, demonstrates the value of severity-aware evaluation, and shows that severity information helps avoid over-confident but misaligned safety judgments. The proposed framework enables more nuanced content filtering aligned with varying platform safety thresholds and offers a foundation for future severity-aware moderation at scale, including public release plans and explicit ethics considerations.

Abstract

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

BingoGuard: LLM Content Moderation Tools with Risk Levels

TL;DR

The paper tackles the insufficiency of binary LLM content moderation by introducing a per-topic severity taxonomy and a generate-then-filter data synthesis framework to train a multi-task moderator that outputs both safety labels and severity levels. It presents BingoGuard, an 8B-class model trained via supervised fine-tuning on BingoGuardTrain, achieving state-of-the-art results on BingoGuardTest and public benchmarks, and reveals that explicit severity supervision improves both accuracy and risk calibration. The work provides BingoGuardTrain and BingoGuardTest datasets, demonstrates the value of severity-aware evaluation, and shows that severity information helps avoid over-confident but misaligned safety judgments. The proposed framework enables more nuanced content filtering aligned with varying platform safety thresholds and offers a foundation for future severity-aware moderation at scale, including public release plans and explicit ethics considerations.

Abstract

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

Paper Structure

This paper contains 27 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overall contributions of our work. We start by defining taxonomy with severity rubrics (left). Then, we implement a data synthesis framework that produces data that matches the severity taxonomy (middle). Finally, we leverage the datasets to train moderation models that outperform prior work on both internal and external evaluation benchmarks related to content moderation (right).
  • Figure 2: An illustration of the taxonomy. We show the 11 topics, 7 dimensions, and the 5 risks of harm in the upper part and give a concrete rubrics example in the lower part. We use the underlines and colors to highlight how the dimensions shape the final concrete rubrics.
  • Figure 3: The framework for generating harmful responses of different levels. (Top) the three steps for fine-tuning specialized LLM generators to obtain responses of different levels. (Bottom) the refinement process illustrated on a concrete example. The arrows show the order of the procedure.
  • Figure 4: Averaged predictive probability on 'unsafe' token for unsafe examples of different levels. The x-axis shows the levels. The y-axis shows the predictive probability. We show that the predictive probability of LlamaGuard3 and MD-Judge are only weakly correlated with the severity.
  • Figure 5: Ablation study models, sizes, iterations, and excluding severity classification.
  • ...and 1 more figures