Table of Contents
Fetching ...

Backtracking for Safety

Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, Siddhartha Reddy Jonnalagadda

TL;DR

This paper proposes a novel backtracking method that allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation, and dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks and often result in shallow safety alignment, primarily focusing on preventing harmful content in the initial tokens of the generated output. While methods like resetting can help recover from unsafe generations by discarding previous tokens and restarting the generation process, they are not well-suited for addressing nuanced safety violations like toxicity that may arise within otherwise benign and lengthy generations. In this paper, we propose a novel backtracking method designed to address these limitations. Our method allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation. This approach enables targeted correction of problematic segments without discarding the entire generated text, thereby preserving efficiency. We demonstrate that our method dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.

Backtracking for Safety

TL;DR

This paper proposes a novel backtracking method that allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation, and dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks and often result in shallow safety alignment, primarily focusing on preventing harmful content in the initial tokens of the generated output. While methods like resetting can help recover from unsafe generations by discarding previous tokens and restarting the generation process, they are not well-suited for addressing nuanced safety violations like toxicity that may arise within otherwise benign and lengthy generations. In this paper, we propose a novel backtracking method designed to address these limitations. Our method allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation. This approach enables targeted correction of problematic segments without discarding the entire generated text, thereby preserving efficiency. We demonstrate that our method dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of various responses to different types of prefilling attacks. Bold text represent model's generation, regular text represents the prefilling by the user, gray tokens generate special tokens to alter the generation.
  • Figure :
  • Figure :