Table of Contents
Fetching ...

On Calibration of LLM-based Guard Models for Reliable Content Moderation

Hongfu Liu, Hengguan Huang, Xiangming Gu, Hao Wang, Ye Wang

TL;DR

The paper addresses the reliability of LLM-based guard models for content moderation by evaluating confidence calibration across 9 models and 12 benchmarks for both prompts and responses. It reveals pervasive miscalibration and overconfidence, worsened by jailbreak scenarios and varying response-model outputs, and demonstrates the partial effectiveness of post-hoc methods such as Contextual Calibration and Temperature Scaling. The study provides practical guidance on calibration techniques, including domain-specific insights (prompt calibration favors contextual cues, while response calibration benefits from temperature scaling) and highlights the need for reliability evaluation in model releases. These findings push for more robust calibration-aware design and evaluation to improve safe deployment of Guard models in real-world settings.

Abstract

Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.

On Calibration of LLM-based Guard Models for Reliable Content Moderation

TL;DR

The paper addresses the reliability of LLM-based guard models for content moderation by evaluating confidence calibration across 9 models and 12 benchmarks for both prompts and responses. It reveals pervasive miscalibration and overconfidence, worsened by jailbreak scenarios and varying response-model outputs, and demonstrates the partial effectiveness of post-hoc methods such as Contextual Calibration and Temperature Scaling. The study provides practical guidance on calibration techniques, including domain-specific insights (prompt calibration favors contextual cues, while response calibration benefits from temperature scaling) and highlights the need for reliability evaluation in model releases. These findings push for more robust calibration-aware design and evaluation to improve safe deployment of Guard models in real-world settings.

Abstract

Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.

Paper Structure

This paper contains 26 sections, 6 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: An overview of LLM-based guard models for content moderation. Guard models monitor the input and output during conversations between the user and LLM (Agent), providing a binary prediction followed by a specific unsafe content category if unsafe content is detected. The instruction examples for prompt classification and response classification from LLama-Guard are detailed in the right yellow boxes respectively.
  • Figure 2: Confidence distributions (First row) and reliability diagrams (Second row) of Llama-Guard, Llama-Guard3, Aegis-Guard-P, and WildGuard on the WildGuardMix Test Prompt set.
  • Figure 3: F1 (%) $\uparrow$ and ECE (%) $\downarrow$ performances of prompt and response classification on Harmbench-adv set.
  • Figure 4: Confidence distributions (First row) and reliability diagrams (Second row) on the WildGuardMix Test Prompt set.
  • Figure 7: Confidence distributions (First row) and reliability diagrams (Second row) on the Harmbench Prompt set.