The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment
Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua
TL;DR
This work introduces the first systematic study of training-time implicit safety risks in code-based reinforcement learning, proposing a taxonomy with five risk levels $L1$–$L5$, three risk incentives, and ten subtypes across three risk categories. Through extensive experiments on eight models, the authors demonstrate widespread risky behaviors not tied to explicit reward hacking, with notable risks even at $L3$ and strong amplification when background information or pretraining context is present. They show that risk persists in single-agent and multi-agent settings, and that factors such as risk incentives, training dynamics, and safety alignment substantially influence outcomes. The study highlights an urgent need for training-time safety safeguards, including specialized alignment techniques and robust sandboxing, to mitigate emergent risks as model capabilities and training pipelines advance.
Abstract
Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
