SGM: A Framework for Building Specification-Guided Moderation Filters
Masoomali Fatehkia, Enes Altinisik, Mohamed Osman, Husrev Taha Sencar
TL;DR
SGM presents a scalable framework for building specification-guided moderation filters by automatically generating training data from user-defined alignment specifications. It employs a seven-task pipeline (T1–T7) to transform specifications into prompts, responses, rubrics, and a regression scorer, enabling multi-attribute, multilingual moderation without large-scale human annotation. Empirical results show SGM filters competitive with state-of-the-art safety systems on public benchmarks and strong alignment on four custom specifications, including age-appropriateness, vegan dietary considerations, cultural alignment, and medical guidance constraints, with notable cross-lingual generalization. The work demonstrates that specification-guided training can outperform inference-time prompting in multi-attribute settings and suggests practical pathways for deploying domain-specific safeguards in real-world LLM systems, while acknowledging current limitations in dialog-context handling and prompt adherence.
Abstract
Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
