Table of Contents
Fetching ...

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe

TL;DR

The paper tackles the enduring safety-capability trade-off in fine-tuned LLMs by analyzing Reinforcement Learning with Verifiable Rewards (RLVR) under KL constraints. It develops a theoretical framework showing that RLVR reweights the path distribution via a Gibbs tilt, yielding explicit optimality conditions and a bound on safety drift that can be controlled by the chi-squared divergence between policies. Empirically, RLVR models trained on mathematics and coding tasks demonstrate improved reasoning while preserving or enhancing safety guardrails across five adversarial safety benchmarks, with robust ablations showing resilience to Algorithm choice, model size, and task domain. The results challenge the assumption of an inevitable safety-capability trade-off and offer a principled approach for deploying reasoning-capable LLMs with strong safety guarantees, guided by verifiable rewards and KL regularization.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

TL;DR

The paper tackles the enduring safety-capability trade-off in fine-tuned LLMs by analyzing Reinforcement Learning with Verifiable Rewards (RLVR) under KL constraints. It develops a theoretical framework showing that RLVR reweights the path distribution via a Gibbs tilt, yielding explicit optimality conditions and a bound on safety drift that can be controlled by the chi-squared divergence between policies. Empirically, RLVR models trained on mathematics and coding tasks demonstrate improved reasoning while preserving or enhancing safety guardrails across five adversarial safety benchmarks, with robust ablations showing resilience to Algorithm choice, model size, and task domain. The results challenge the assumption of an inevitable safety-capability trade-off and offer a principled approach for deploying reasoning-capable LLMs with strong safety guarantees, guided by verifiable rewards and KL regularization.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

Paper Structure

This paper contains 39 sections, 7 theorems, 31 equations, 6 figures, 5 tables.

Key Result

Corollary 1

The expected success and safety rates under model $\pi$ for input $\boldsymbol{x}$ are:

Figures (6)

  • Figure 1: Paired differences of harmfulness scores between the fine-tuned model and its corresponding base model. The SFT fine-tuned model's paired differences are shown in dashed lines. While the RLVR-trained model exhibits paired differences centered around zero with low variability (shaded region), the SFT-trained model demonstrates consistently higher paired difference scores.
  • Figure 2: Paired differences of harmfulness scores between the fine-tuned model and its corresponding base model. The shaded region denotes the overall mean and standard deviations.
  • Figure 3: Paired differences of harmfulness scores between the fine-tuned model and its corresponding base model. The shaded region denotes the overall mean and standard deviations.
  • Figure 4: Paired differences of harmfulness scores between the fine-tuned model and its corresponding base model. The shaded region denotes the overall mean and standard deviations.
  • Figure 5: Harmfulness scores across different temperature settings.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Corollary 1: Success and Safety Rate under $\pi$
  • proof
  • Theorem 1: Optimal Policy
  • Theorem 2: Safety Drift Upper Bound
  • Proposition 1: Safety invariance
  • proof
  • Proposition 2: Worst Case Upper Bound
  • proof
  • Theorem 1: Optimal Policy
  • proof
  • ...and 2 more