Table of Contents
Fetching ...

Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C. Lipton, J. Zico Kolter

TL;DR

This work tackles the brittleness of post-hoc safety alignment in large language models by proposing a data-centric pretraining framework that embeds safety from the start. It combines safety filtering, synthetic recontextualization, native refusal training (RefuseWeb and Moral Education data), and Harmfulness-Tag annotated pretraining, plus Safe Beam Search for inference-time steering and new evaluation tools. Empirically, safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard benchmarks while maintaining performance on standard tasks, indicating robust safety without sacrificing utility. The SafeLM family and the accompanying Safe Playground aim to democratize safety research, establishing a foundation for inherently safer AI systems and a practical, scalable approach to safety that persists under benign finetuning and adversarial challenges.

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.

Safety Pretraining: Toward the Next Generation of Safe AI

TL;DR

This work tackles the brittleness of post-hoc safety alignment in large language models by proposing a data-centric pretraining framework that embeds safety from the start. It combines safety filtering, synthetic recontextualization, native refusal training (RefuseWeb and Moral Education data), and Harmfulness-Tag annotated pretraining, plus Safe Beam Search for inference-time steering and new evaluation tools. Empirically, safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard benchmarks while maintaining performance on standard tasks, indicating robust safety without sacrificing utility. The SafeLM family and the accompanying Safe Playground aim to democratize safety research, establishing a foundation for inherently safer AI systems and a practical, scalable approach to safety that persists under benign finetuning and adversarial challenges.

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.

Paper Structure

This paper contains 74 sections, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Data Safety Report Card. Our new proposed standardized report card on the safety of pretraining datasets. We report (i) the distribution over safety scores on a sample of our pretraining data and (ii) the frequencies of content (per 1 million tokens) from the MLCommons Safety Taxonomy vidgen2024introducing in different components of our pretraining mixture.
  • Figure 2: Safety Pretraining Yields Natively Aligned and Robust Models Against Attacks. We compare attack success rates (ASR) across three settings: base model evaluations for safety, post instruction (safety)-tuning, and post benign finetuning attacks. Each stage includes evaluations for three variants—standard pretraining, safety pretraining, and safety pretraining with SafeBeam decoding. (1) Safety pretraining produces inherently safer base models, as reflected by the substantially lower ASR in the safety-pretrained base models (leftmost section). (2) While surface-level alignment via safety instruction tuning may initially reduce ASR (middle section), its brittleness becomes apparent in the sharp increase in ASR post even a small amount of benign finetuning (e.g., on GSM8k here). In contrast, our safety pretrained models are much more robust, and exhibit much lower increase in ASR under benign finetuning.
  • Figure 3: Safety Pretraining maintains or improves helpfulness on benign requests. We compare overrefusal behavior on Alpaca taori2023stanford. We observe that Safety Pretraining leads to no drop in compliance rate on benign requests. Adding in SafeBeam during inference leads to a slight increase in overrefusal rate.
  • Figure 4: Ablating Importance of Data-Centric Interventions. We evaluate the impact of progressively richer data-centric interventions on safety, measured by Attack Success Rate (ASR) post benign-finetuning on GSM8k dataset. Recent works like qi2024safetybetley2025emergent have highighted the brittleness of safety alignment, especially under benign finetuning. Our safety pretrained models are natively safe and maintain low ASR even after benign finetuning. Interestingly, training exclusively on the safest subset (score-0 only) leads to higher ASR compared to training on the whole dataset, likely due to lack of exposure to unsafe patterns. In contrast, incorporating rephrased unsafe content (which contextualizes it in a safe and sensitive fashion) —substantially reduces ASR. Further gains are achieved by adding refusal-style completions sourced from highly unsafe content (score-4 and score-5 data), with the greatest improvement observed when moral education data is included. These results underscore the need for both contextual exposure and ethically aligned supervision during pretraining to build safer models.
  • Figure 5: Confusion matrices of different safety scoring approaches. The LLM-based approach and our ensembling strategy lead to more stringent filters than using embedding-based classifiers, at the cost of over-predicting instances with actual score 0.
  • ...and 2 more figures