BingoGuard: LLM Content Moderation Tools with Risk Levels

Fan Yin; Philippe Laban; Xiangyu Peng; Yilun Zhou; Yixin Mao; Vaibhav Vats; Linnea Ross; Divyansh Agarwal; Caiming Xiong; Chien-Sheng Wu

BingoGuard: LLM Content Moderation Tools with Risk Levels

Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu

TL;DR

The paper tackles the insufficiency of binary LLM content moderation by introducing a per-topic severity taxonomy and a generate-then-filter data synthesis framework to train a multi-task moderator that outputs both safety labels and severity levels. It presents BingoGuard, an 8B-class model trained via supervised fine-tuning on BingoGuardTrain, achieving state-of-the-art results on BingoGuardTest and public benchmarks, and reveals that explicit severity supervision improves both accuracy and risk calibration. The work provides BingoGuardTrain and BingoGuardTest datasets, demonstrates the value of severity-aware evaluation, and shows that severity information helps avoid over-confident but misaligned safety judgments. The proposed framework enables more nuanced content filtering aligned with varying platform safety thresholds and offers a foundation for future severity-aware moderation at scale, including public release plans and explicit ethics considerations.

Abstract

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

BingoGuard: LLM Content Moderation Tools with Risk Levels

TL;DR

Abstract

BingoGuard: LLM Content Moderation Tools with Risk Levels

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)