Table of Contents
Fetching ...

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt

TL;DR

Boundary Guidance reframes fine tuning of generative models in safety pipelines as a boundary-avoidance problem within compound safety systems. By coupling a reward model with a safety classifier and optimizing via reinforcement learning, the method steers generations away from the classifier margin, improving both utility and safety across multiple model scales. The approach is validated on jailbreak and ambiguous prompts, showing Pareto improvements in helpfulness and harmlessness, especially for smaller models, and is supported by extensive ablations that reveal the relative importance of reward components and the risks of prompt-aware shaping. The work highlights the value of optimizing whole deployment systems rather than isolated components and points to future work on multi-dimensional harms and welfare-aware filtering.

Abstract

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

Don't Walk the Line: Boundary Guidance for Filtered Generation

TL;DR

Boundary Guidance reframes fine tuning of generative models in safety pipelines as a boundary-avoidance problem within compound safety systems. By coupling a reward model with a safety classifier and optimizing via reinforcement learning, the method steers generations away from the classifier margin, improving both utility and safety across multiple model scales. The approach is validated on jailbreak and ambiguous prompts, showing Pareto improvements in helpfulness and harmlessness, especially for smaller models, and is supported by extensive ablations that reveal the relative importance of reward components and the risks of prompt-aware shaping. The work highlights the value of optimizing whole deployment systems rather than isolated components and points to future work on multi-dimensional harms and welfare-aware filtering.

Abstract

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

Paper Structure

This paper contains 33 sections, 1 theorem, 5 equations, 3 figures, 5 tables.

Key Result

Proposition 1

eq:constant-u-reward is strictly decreasing for $t < \tau$ and strictly increasing for $t \ge \tau$.

Figures (3)

  • Figure 1: Left: The Filtered Generation setting. A user provides a prompt $x$ to a model, which generates according to a generation policy $\pi_\theta(x)$ an output $y$. This output is only shown if a safety classifier deems the output safe. In case where it is not safe, $t(x, y) \ge \tau$, it is filtered, and a refusal is returned. Right: The main observation in this paper is that generative models $\pi_\theta$ can be adjusted to avoid the decision boundary of the filter model in a process we call Boundary Guidance, reducing false positive and false negative filtering, and increasing system utility.
  • Figure 2: Main results. Our Boundary Guidance fine-tuning approach that incorporates both reward model and safety classifier signals into training lead to Pareto improvements in both utility and safety (except for Qwen-2.5-14B-Instruct helpfulness) as judged by ChatGPT 4.1. For further experimental details see Section \ref{['sec:experiments']} and the appendices.
  • Figure 3: Results for ablations. The symbols denote model families while the arrows represent finetuning results. The desired direction is down right. The effects on the smallest model are the largest. The guard only finetuning setup improves evaluation results in both directions (expect for Qwen-2.5-0.5B), whereas prompt-aware training reduces performance uniformly.

Theorems & Definitions (1)

  • Proposition 1