Don't Walk the Line: Boundary Guidance for Filtered Generation
Sarah Ball, Andreas Haupt
TL;DR
Boundary Guidance reframes fine tuning of generative models in safety pipelines as a boundary-avoidance problem within compound safety systems. By coupling a reward model with a safety classifier and optimizing via reinforcement learning, the method steers generations away from the classifier margin, improving both utility and safety across multiple model scales. The approach is validated on jailbreak and ambiguous prompts, showing Pareto improvements in helpfulness and harmlessness, especially for smaller models, and is supported by extensive ablations that reveal the relative importance of reward components and the risks of prompt-aware shaping. The work highlights the value of optimizing whole deployment systems rather than isolated components and points to future work on multi-dimensional harms and welfare-aware filtering.
Abstract
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
