Table of Contents
Fetching ...

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar

TL;DR

This work addresses safety vulnerabilities in large language models by enabling adaptive reasoning with test-time compute. It introduces TARS: Training Adaptive Reasoners for Safety, a three-stage RL framework that lightly supervises with diverse long chain-of-thought traces, curates a mix of harmful, harmless, and ambiguous prompts, and employs a dual reward design to balance safety with task completion. Empirical results show that TARS achieves superior safety–refusal trade-offs, longer and more nuanced reasoning on ambiguous prompts, and stronger robustness to white-box and black-box jailbreak attacks, even at smaller model scales. The approach provides an open recipe for training LLMs against jailbreaks by reasoning per prompt, with implications for scalable and adaptable safety defenses in real-world deployments.

Abstract

Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $\textit{TARS}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a ``lightweight'' warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

Reasoning as an Adaptive Defense for Safety

TL;DR

This work addresses safety vulnerabilities in large language models by enabling adaptive reasoning with test-time compute. It introduces TARS: Training Adaptive Reasoners for Safety, a three-stage RL framework that lightly supervises with diverse long chain-of-thought traces, curates a mix of harmful, harmless, and ambiguous prompts, and employs a dual reward design to balance safety with task completion. Empirical results show that TARS achieves superior safety–refusal trade-offs, longer and more nuanced reasoning on ambiguous prompts, and stronger robustness to white-box and black-box jailbreak attacks, even at smaller model scales. The approach provides an open recipe for training LLMs against jailbreaks by reasoning per prompt, with implications for scalable and adaptable safety defenses in real-world deployments.

Abstract

Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a ``lightweight'' warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

Paper Structure

This paper contains 33 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Training Adaptive Reasoners for Safety (TARS). Our recipe for training adaptive reasoners is built in three stages. The first stage lightly trains the base model through SFT on exploratory reasoning and answer traces for diversity. The second stage gathers harmful, harmless, and ambiguous prompts for RL. The third stage shapes separate reward functions, resulting in improved safety and less refusal when trained with RL.
  • Figure 2: TARS vs. Baselines. Safety-refusal trade-off of models trained with TARS compared to representation re-routing, existing reasoning defenses, and instruction-tuned models. The Defense Success Rate is averaged over all four attacks (GCG, PAIR, AutoDAN, PAP). (a) Comparison to Llama-RR and Mistral-RR retrained without XSTest benchmarked on XSTest for compliance. (b) Comparison to the original released Llama-RR and Mistral-RR benchmarked on WildChat for compliance.
  • Figure 3: TARS vs. SFT/DPO/RL. Pareto frontier showing the safety-refusal trade-off of models trained with TARS and SFT/DPO/RL averaged over all attacks, benchmarked on XSTest for compliance. $\lambda=0.1$ for models with highest compliance and $\lambda=0.9$ for lowest compliance for each line.
  • Figure 4: Internal representations. Representations of GCG attack prompts and XSTest "safe" (ambiguous) prompts from the last embedding layer projected into 2D using UMAP mcinnes2018umap. TARS attains the largest margin when fitting a soft SVM, indicating that it internally separates the two types of prompts better for adaptivity.
  • Figure 5: GCG vs. PAIR. Safety-refusal trade-off of GCG (white-box) and PAIR (black-box) separated, benchmarked on XSTest for compliance. TARS still attains the best safety-refusal tradeoff individually.
  • ...and 5 more figures