Safety Pretraining: Toward the Next Generation of Safe AI
Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C. Lipton, J. Zico Kolter
TL;DR
This work tackles the brittleness of post-hoc safety alignment in large language models by proposing a data-centric pretraining framework that embeds safety from the start. It combines safety filtering, synthetic recontextualization, native refusal training (RefuseWeb and Moral Education data), and Harmfulness-Tag annotated pretraining, plus Safe Beam Search for inference-time steering and new evaluation tools. Empirically, safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard benchmarks while maintaining performance on standard tasks, indicating robust safety without sacrificing utility. The SafeLM family and the accompanying Safe Playground aim to democratize safety research, establishing a foundation for inherently safer AI systems and a practical, scalable approach to safety that persists under benign finetuning and adversarial challenges.
Abstract
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.
