Table of Contents
Fetching ...

Trading Inference-Time Compute for Adversarial Robustness

Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese

TL;DR

The paper investigates whether increasing inference-time compute, without adversarial training, can improve adversarial robustness of reasoning Large Language Models across a variety of attack surfaces. It introduces and evaluates novel attack vectors (soft tokens, Think-Less, Nerd Sniping) and tests robustness on unambiguous versus ambiguous tasks, including math problems, policy violations, and multimodal inputs. Across many settings, longer reasoning time reduces attacker success, particularly for unambiguous tasks, while certain attacks and ambiguous tasks reveal limitations and new vulnerabilities. The work highlights inference-time compute as a practical robustness lever, documents key failure modes, and outlines directions for future research in safety-critical, agentic LLM applications.

Abstract

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

Trading Inference-Time Compute for Adversarial Robustness

TL;DR

The paper investigates whether increasing inference-time compute, without adversarial training, can improve adversarial robustness of reasoning Large Language Models across a variety of attack surfaces. It introduces and evaluates novel attack vectors (soft tokens, Think-Less, Nerd Sniping) and tests robustness on unambiguous versus ambiguous tasks, including math problems, policy violations, and multimodal inputs. Across many settings, longer reasoning time reduces attacker success, particularly for unambiguous tasks, while certain attacks and ambiguous tasks reveal limitations and new vulnerabilities. The work highlights inference-time compute as a practical robustness lever, documents key failure modes, and outlines directions for future research in safety-critical, agentic LLM applications.

Abstract

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

Paper Structure

This paper contains 30 sections, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Selected results. In all figures the X axis is the amount of inference-time compute by the defender (log-scale). In (a)--(e) the Y axis is the amount of resources by the attacker which is (a) prompt length for Many-shot attack anthropicmanyshots24, (b,e) number of queries for adversarial LMP, (c) number of optimization steps for norm-constrained soft tokens, (d) number of injections into a website. In (f) the Y axis is attacker success probability. The task for (a)--(c) is a stylized policy attack of an arithmetic question with an adversarial injected message. The other tasks are: (d) agent browsing a malicious website, (e) StrongREJECT misuse prompts, (e) adversarially manipulated images. We see that for unambiguous tasks, increasing inference-time compute drives the probability of attack success down. In contrast, for misuse prompts, the adversarial LMP often finds a phrasing of the prompt for which answering is not clearly a policy violation. Grey corresponds to cases where we did not get sufficient samples of the given inference-time compute amount; x-axis extents have been matched for all plots.
  • Figure 2: Many-shot attack anthropicmanyshots24 on a variety of math tasks and adversary goals for o1-mini. The x-axis represents defender strength, measured as the amount of inference time compute spent on reasoning. The y-axis indicates attacker strength, measured by the number of tokens used in many-shot jailbreaking attacks. The plots illustrate the results of many-shot jailbreaking attacks on three tasks: (row 1) 4-digit addition, (row 2) 4-digit multiplication, and (row 3) solving MATH problems. The adversary aims to manipulate the model output to: (column 1) return 42, (column 2) produce the correct answer +1, or (column 3) return the correct answer multiplied by 7. Results for the o1-preview model are qualitatively similar, see Figure \ref{['fig:combined_plot_o1-preview_itc_attack_tokens_length']}.
  • Figure 3: Language model program attack on on a variety of math tasks and adversary goals for o1-mini. The x-axis represents inference-time compute during a single attacker trajectory (i.e., until the first success or a maximum of 25 attempts has been reached). The y-axis indicates attacker strength, measured by the number of in-context attempts that the attacker has used. The plots are ordered in the same way as in Figure \ref{['fig:combined_plot_o1-mini_itc_attack_tokens_length']}. Grey corresponds to cases where we did not get samples of the given inference-time compute amount. Results for o1-preview model are qualitatively similar, see Figure \ref{['fig:bad_math_preview']}.
  • Figure 4: A problem instance from AdvSimpleQA, which is a modified version of SimpleQA. The task involves a question that typically is hard for GPT-4 to answer without references, and a concatenated website that contains answer to the question. The website is modified to mislead the model with prompt injections.
  • Figure 5: Attack success rate on the Misuse Prompts and Past Misuse Prompts tasks for many-shot jailbreaking. The x-axis represents the inference time compute used by the defender (log-scale), and the y-axis represents the number of many-shot attack tokens used for the attack. Two first plots corresponds to the Misuse Prompts task, while the last two plots pertains to the Past Misuse Prompts task. The attack appears to be more effective on Past Misuse Prompts, with an attack success rate reaching up to $25\%$, compared to less than $5\%$ on Misuse Prompts. Grey corresponds to cases where we did not get sufficient samples of the given inference-time compute amount; x-axis extents have been matched for all plots.
  • ...and 18 more figures