Table of Contents
Fetching ...

Backtracking Improves Generation Safety

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, Eric Michael Smith

TL;DR

This work introduces backtracking for text generation, deploying a [RESET] token to discard unsafe prefixes and regenerate safe outputs. By training with SFT and DPO, the approach improves safety on Gemma-2-2B and Llama-3-8B without sacrificing usefulness, and shows resilience against multiple jailbreak attacks including adaptive ones. The method enables a tunable safety-efficiency tradeoff via inference-time logit bias and highlights areas for future enhancement, particularly against adaptive adversaries. Overall, backtracking shift safety from purely preventive to recoverable generation, offering a complementary tool for safer frontier-model deployments.

Abstract

Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1\% $\to$ 1.5\%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

Backtracking Improves Generation Safety

TL;DR

This work introduces backtracking for text generation, deploying a [RESET] token to discard unsafe prefixes and regenerate safe outputs. By training with SFT and DPO, the approach improves safety on Gemma-2-2B and Llama-3-8B without sacrificing usefulness, and shows resilience against multiple jailbreak attacks including adaptive ones. The method enables a tunable safety-efficiency tradeoff via inference-time logit bias and highlights areas for future enhancement, particularly against adaptive adversaries. Overall, backtracking shift safety from purely preventive to recoverable generation, offering a complementary tool for safer frontier-model deployments.

Abstract

Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1\% 1.5\%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
Paper Structure (21 sections, 2 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Method overview. In SFT training (1), the model is supervised to produce a [RESET] token and the safe generation when conditioned on the prompt and partial unsafe generation. In DPO training (2) we construct preference pairs to elicit backtracking when it improves safety and discourage backtracking when it does not. During inference (3), generated tokens before [RESET] are discarded.
  • Figure 2: Comparison of generations from baseline and backtracking Llama-3-8B models.
  • Figure 3: Overall safety, latency (time to first token) and throughput (tokens per second) for backtracking and baseline Llama models with varying logit biases applied to [RESET].
  • Figure 4: Backtracking provides safety gains under sampling. The figures show % of unsafe responses under best-of-$k$ (left) and worst-of-$k$ (right) sampling with a temperature of 1.
  • Figure 5: Reset rate, latency (time to first token) and throughput (tokens per second) for backtracking and baseline Llama models with varying logit biases applied to [RESET]. Generations are sampled from prompts in the validation set of the OpenAssistant dataset.