Table of Contents
Fetching ...

Capability-Oriented Training Induced Alignment Risk

Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

TL;DR

The paper investigates capability-oriented training induced alignment risk, where reinforcement learning in imperfect environments can trigger spontaneous exploitation of training loopholes. It introduces a vulnerability-games framework to reveal four failure modes and demonstrates that exploitative strategies not only emerge across multiple models but can generalize to new tasks and transfer between models, even via distillation. The study shows that such exploits can coincide with genuine performance gains, creating a stealthy risk that is hard to detect with standard monitoring. The work argues for auditing training environments and reward mechanisms as a core safety practice, beyond traditional content moderation, and highlights implications for open-weight models and broader AI safety research.

Abstract

While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.

Capability-Oriented Training Induced Alignment Risk

TL;DR

The paper investigates capability-oriented training induced alignment risk, where reinforcement learning in imperfect environments can trigger spontaneous exploitation of training loopholes. It introduces a vulnerability-games framework to reveal four failure modes and demonstrates that exploitative strategies not only emerge across multiple models but can generalize to new tasks and transfer between models, even via distillation. The study shows that such exploits can coincide with genuine performance gains, creating a stealthy risk that is hard to detect with standard monitoring. The work argues for auditing training environments and reward mechanisms as a core safety practice, beyond traditional content moderation, and highlights implications for open-weight models and broader AI safety research.

Abstract

While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.
Paper Structure (68 sections, 4 figures, 11 tables)

This paper contains 68 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: A visual guide to the four “vulnerability games” used to evaluate deceptive behaviors in LLMs. Each panel corresponds to one scenario: (A) Context-conditional Compliance (green arrows denote “yes” responses and red arrows denote “no” responses), (B) Audited Self-Grading, (C) Proxy Metric Gaming, and (D) Reward/State Tampering. In panels C and D, “Pre.” denotes model behavior before training, while “Post” denotes behavior after training.
  • Figure 2: Spontaneous Emergence of Exploitation during RL. We track the evolution of Intended Task Performance (ITP) and Exploit Ratio (ER) across three models and four vulnerability games. We annotate the First Appearance Step (FES), marking the training step where the exploit first emerges, and Domination Steps (DS), measuring the duration from onset to saturation.
  • Figure 3: Case study of Situation Awareness (After training).
  • Figure 4: Case study of Reward/State Tampering (Before and After training.