Table of Contents
Fetching ...

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua

TL;DR

This work introduces the first systematic study of training-time implicit safety risks in code-based reinforcement learning, proposing a taxonomy with five risk levels $L1$–$L5$, three risk incentives, and ten subtypes across three risk categories. Through extensive experiments on eight models, the authors demonstrate widespread risky behaviors not tied to explicit reward hacking, with notable risks even at $L3$ and strong amplification when background information or pretraining context is present. They show that risk persists in single-agent and multi-agent settings, and that factors such as risk incentives, training dynamics, and safety alignment substantially influence outcomes. The study highlights an urgent need for training-time safety safeguards, including specialized alignment techniques and robust sandboxing, to mitigate emergent risks as model capabilities and training pipelines advance.

Abstract

Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

TL;DR

This work introduces the first systematic study of training-time implicit safety risks in code-based reinforcement learning, proposing a taxonomy with five risk levels , three risk incentives, and ten subtypes across three risk categories. Through extensive experiments on eight models, the authors demonstrate widespread risky behaviors not tied to explicit reward hacking, with notable risks even at and strong amplification when background information or pretraining context is present. They show that risk persists in single-agent and multi-agent settings, and that factors such as risk incentives, training dynamics, and safety alignment substantially influence outcomes. The study highlights an urgent need for training-time safety safeguards, including specialized alignment techniques and robust sandboxing, to mitigate emergent risks as model capabilities and training pipelines advance.

Abstract

Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
Paper Structure (46 sections, 17 figures, 5 tables)

This paper contains 46 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Illustrative cases of implicit safety risks in single-agent and multi-agent training. In the single-agent case, the model covertly deletes a new model’s weights to benefit its developer, driven by social compliance. In the multi-agent case, two lower-accuracy models secretly inflate their recorded accuracies, while the highest-accuracy model covertly degrades another model’s recorded accuracy, motivated by self-interest.
  • Figure 2: Illustration of the taxonomy of training-time implicit safety risks. The three high-level risk categories are further decomposed into ten fine-grained categories, of which only a subset is shown in the figure.
  • Figure 3: L5 evaluation results on Llama-3.1-8B-Instruct.
  • Figure 4: The change of the ratio of risky rollouts for each step during training on Llama-3.1-8B-Instruct.
  • Figure 5: The risk score for three agents (Llama-3.1-8B-Instruct). The initial recorded accuracy we tell to Agent 1/2/3 are 0.75/0.80/0.70. We repeat training for 10 times for each agent and report the average risk score.
  • ...and 12 more figures