Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese
TL;DR
The paper investigates whether increasing inference-time compute, without adversarial training, can improve adversarial robustness of reasoning Large Language Models across a variety of attack surfaces. It introduces and evaluates novel attack vectors (soft tokens, Think-Less, Nerd Sniping) and tests robustness on unambiguous versus ambiguous tasks, including math problems, policy violations, and multimodal inputs. Across many settings, longer reasoning time reduces attacker success, particularly for unambiguous tasks, while certain attacks and ambiguous tasks reveal limitations and new vulnerabilities. The work highlights inference-time compute as a practical robustness lever, documents key failure modes, and outlines directions for future research in safety-critical, agentic LLM applications.
Abstract
We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.
