Table of Contents
Fetching ...

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Weitao Feng, Lixu Wang, Tianyi Wei, Jie Zhang, Chongyang Gao, Sinong Zhan, Peizhuo Lv, Wei Dong

TL;DR

Harmful reinforcement learning fine-tuning (Harmful-RL) poses a heavier and more robust threat to safety-aligned LLMs than supervised fine-tuning. The authors introduce TokenBuncher, a defense that links entropy management with a Token Noiser to suppress RL-driven harmful optimization while preserving benign performance and finetunability. By treating rollout entropy as the online reward and coupling it with a capability-binding noise mechanism, TokenBuncher generalizes to unseen harmful queries and withstands adaptive attacks across multiple models and RL algorithms. Empirical results show substantial reductions in harmfulness with minimal cost to benign capabilities, highlighting the need for defense strategies that jointly address safety and functional robustness in RL-based threats.

Abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

TL;DR

Harmful reinforcement learning fine-tuning (Harmful-RL) poses a heavier and more robust threat to safety-aligned LLMs than supervised fine-tuning. The authors introduce TokenBuncher, a defense that links entropy management with a Token Noiser to suppress RL-driven harmful optimization while preserving benign performance and finetunability. By treating rollout entropy as the online reward and coupling it with a capability-binding noise mechanism, TokenBuncher generalizes to unseen harmful queries and withstands adaptive attacks across multiple models and RL algorithms. Empirical results show substantial reductions in harmfulness with minimal cost to benign capabilities, highlighting the need for defense strategies that jointly address safety and functional robustness in RL-based threats.

Abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Paper Structure

This paper contains 46 sections, 1 theorem, 20 equations, 12 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Consider a policy $\pi_\theta(\bm{y} \mid \bm{q})$ parameterized by a softmax function over logits $z_\theta(\cdot \mid \bm{q})$. Let the optimization objective be defined as: Assume: Let $\bar{H}(\pi_\theta)=\mathbb E_{\bm q\sim\mathcal{D}}[H(\pi_\theta(\cdot\mid \bm q))]$. Then there exists a constant $C>0$ such that

Figures (12)

  • Figure 1: Entropy and accuracy with RL training. Left: Entropy distributions of model outputs on GSM8K test samples decrease with RL. Right: Corresponding model accuracy on the GSM8K test set increases with more RL training steps.
  • Figure 2: Reward-model score distribution during the first 10 Harmful-RL steps under GRPO training for Qwen2.5-3B-Instruct. Higher reward indicates more harmful outputs.
  • Figure 3: Overview of our TokenBuncher framework. (a) Training pipeline: We employ interleaved training on mixed data. For benign queries, the model uses the KL divergence as a reward. For harmful queries, the model is optimized using negative entropy as a reward, while a Token Noiser is applied to the probability distribution for joint CrossEntropy optimization. (b) Effect of the Token Noiser against Harmful-RL attacks. Without noise, boosting entropy redistributes probability mass to harmful tokens. With noise, the same attack amplifies the injected randomness, producing incoherent gibberish.
  • Figure 4: Visualization of token probabilities in log scale. Left: Probability distribution of the first 10 tokens in the vocabulary subset when the DEM model responds to a harmful query. Right: Low-probability tokens gain higher probabilities overall after a few steps of Harmful-RL
  • Figure 5: Accuracy curves for benign task fine-tuning. Left: accuracy curve of conducting SFT with Countdown task. Right: accuracy curve of conducting RL fine-tuning with GSM8K.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1