Table of Contents
Fetching ...

Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

Cristina Pinneri, Christos Louizos

TL;DR

This paper exposes a vulnerability in guard models where shallow paraphrase variations cause unstable safety judgments, undermining semantic grounding. It introduces a self-supervised framework that uses meaning-preserving paraphrase sets to quantify semantic fragility and enforce consistency via a novel skew-aware target aggregation during training. Across six open-source guard-model families, the approach substantially reduces paraphrase-induced score variability and label flips (about a 58% reduction) while maintaining or improving benchmark accuracy (≈+2.5%), and it generalizes to unseen stylistic variations; calibration improves by up to 40%. A key finding is a bidirectional link between semantic consistency and calibration, suggesting that robustness training and calibration techniques can be combined for superior guard-model reliability in real-world safety pipelines.

Abstract

Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.

Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

TL;DR

This paper exposes a vulnerability in guard models where shallow paraphrase variations cause unstable safety judgments, undermining semantic grounding. It introduces a self-supervised framework that uses meaning-preserving paraphrase sets to quantify semantic fragility and enforce consistency via a novel skew-aware target aggregation during training. Across six open-source guard-model families, the approach substantially reduces paraphrase-induced score variability and label flips (about a 58% reduction) while maintaining or improving benchmark accuracy (≈+2.5%), and it generalizes to unseen stylistic variations; calibration improves by up to 40%. A key finding is a bidirectional link between semantic consistency and calibration, suggesting that robustness training and calibration techniques can be combined for superior guard-model reliability in real-world safety pipelines.

Abstract

Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.

Paper Structure

This paper contains 57 sections, 5 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our framework for improving guard model robustness. First, we generate and filter paraphrases of an LLM's response to create a semantically equivalent set. This set is used for both evaluation (by measuring score variability) and training (by enforcing prediction consistency using a robust, set-level target).
  • Figure 2: Mean, median, and skew-aware targets for different score distributions.
  • Figure 3: Comparison of score variability across refusal-style (top row) and agreement-style (bottom row) paraphrases for the large guard models.
  • Figure 4: Sensitivity of large guard models to paraphrasing before (top row) and after (bottom row) our robustness training. The tighter clustering of scores in the bottom row demonstrates a significant and consistent reduction in sensitivity across all models.
  • Figure 5: Sensitivity of small guard models to paraphrasing before (top row) and after (bottom row) our robustness training.
  • ...and 5 more figures