Table of Contents
Fetching ...

SGM: A Framework for Building Specification-Guided Moderation Filters

Masoomali Fatehkia, Enes Altinisik, Mohamed Osman, Husrev Taha Sencar

TL;DR

SGM presents a scalable framework for building specification-guided moderation filters by automatically generating training data from user-defined alignment specifications. It employs a seven-task pipeline (T1–T7) to transform specifications into prompts, responses, rubrics, and a regression scorer, enabling multi-attribute, multilingual moderation without large-scale human annotation. Empirical results show SGM filters competitive with state-of-the-art safety systems on public benchmarks and strong alignment on four custom specifications, including age-appropriateness, vegan dietary considerations, cultural alignment, and medical guidance constraints, with notable cross-lingual generalization. The work demonstrates that specification-guided training can outperform inference-time prompting in multi-attribute settings and suggests practical pathways for deploying domain-specific safeguards in real-world LLM systems, while acknowledging current limitations in dialog-context handling and prompt adherence.

Abstract

Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.

SGM: A Framework for Building Specification-Guided Moderation Filters

TL;DR

SGM presents a scalable framework for building specification-guided moderation filters by automatically generating training data from user-defined alignment specifications. It employs a seven-task pipeline (T1–T7) to transform specifications into prompts, responses, rubrics, and a regression scorer, enabling multi-attribute, multilingual moderation without large-scale human annotation. Empirical results show SGM filters competitive with state-of-the-art safety systems on public benchmarks and strong alignment on four custom specifications, including age-appropriateness, vegan dietary considerations, cultural alignment, and medical guidance constraints, with notable cross-lingual generalization. The work demonstrates that specification-guided training can outperform inference-time prompting in multi-attribute settings and suggests practical pathways for deploying domain-specific safeguards in real-world LLM systems, while acknowledging current limitations in dialog-context handling and prompt adherence.

Abstract

Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.

Paper Structure

This paper contains 52 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Architecture of a specification-guided moderation filter. Beyond universal norms like harmlessness, LLMs must increasingly align with institutional, organizational, and user-level constraints. The SGM framework transforms these specifications into training data and uses them to build targeted moderation filters.
  • Figure 2: Data generation workflow for the SGM filter consists of seven main tasks: T1 – Listing specifications; T2 - System-prompt generation; T3 – Test prompt generation; T4 – Prompt validation and rewriting; T5 – Response generation; T6 – Response scoring by a panel of judge LLMs; T7 – Filter training and evaluation. Different LLMs were used for each task based on their capabilities, as detailed in Table \ref{['tab:models']} (see Appendix \ref{['sec:app_models_list']}).
  • Figure 3: Comparing the performance of different base models disaggregated across each specification. Both models were trained as multi-attribute models.
  • Figure 4: The effect of training language on test set performance on the English test set across different specifications. The models shown here are single-attribute models trained for each individual specification with RoBERTa as a base model. The colors of the bars represent the language(s) the model was trained on.
  • Figure 5: The effect of training language on test set performance on the Arabic test set across different specifications. The models shown here are single-attribute models trained for each individual specification with RoBERTa as a base model. The colors of the bars represent the language(s) the model was trained on.
  • ...and 1 more figures