Table of Contents
Fetching ...

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson

TL;DR

The paper tackles robustness gaps in vision-language models under adversarial and out-of-distribution inputs by introducing the Robustness from Inference Compute Hypothesis (RICH). It posits that test-time inference compute can enhance adherence to defensive specifications when attacked data components resemble training data, particularly if base robustness is already present. Through a suite of multimodal attacks and models with varying adversarial training, the authors reveal a rich-get-richer dynamic: stronger base robustness amplifies gains from inference-time reasoning and specification enforcement. They advocate layering train-time defenses with test-time compute to profitably trade computation for robustness, highlighting practical considerations in prompt design, compositional generalization, and attack modality.

Abstract

Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization. For example, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Thus, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

TL;DR

The paper tackles robustness gaps in vision-language models under adversarial and out-of-distribution inputs by introducing the Robustness from Inference Compute Hypothesis (RICH). It posits that test-time inference compute can enhance adherence to defensive specifications when attacked data components resemble training data, particularly if base robustness is already present. Through a suite of multimodal attacks and models with varying adversarial training, the authors reveal a rich-get-richer dynamic: stronger base robustness amplifies gains from inference-time reasoning and specification enforcement. They advocate layering train-time defenses with test-time compute to profitably trade computation for robustness, highlighting practical considerations in prompt design, compositional generalization, and attack modality.

Abstract

Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization. For example, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Thus, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

Paper Structure

This paper contains 26 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Small changes in base model robustness are amplified by reasoning. We do unsupervised adversarial finetuning of embeddings schlarmann2024robustclip on the ViT in InternVL 3.5 gpt-oss 20B wang2025internvl3. This causes more profitable exchanges of test compute for robustness, a prediction of the RICH. Adversarial accuracy is measured on the Attack-Bard dataset dong2023robust. For the image shown, the robust model's 2048-token reasoning notes the weevil's characteristic snout 28 times. The base model mentions the snout 4 times, says it's absent, then answers incorrectly.
  • Figure 2: Attacks on models with more base robustness utilize their instruction following, needing increasingly strong visual evidence for the attack target to negate test compute scaling. The PGD attacker minimizes the negative log likelihood of the target string in underlined and red text. When the PGD attack succeeds, we plot model attention maps, and the base image (blue outline) plus the successful adversarial input. When $K>=1$, the prompt uses the security specification in purple text with the portion in braces repeated $K$ times to emphasize the spec, naively scaling test compute.
  • Figure 3: Can an explicit security specification encourage the model to avoid the visual prompt injection, while gradient-based attacks promote the injection's success? The PGD attacker attempts to minimize the negative log likelihood of its target string, shown in underlined and red text.
  • Figure 4: (Top left) Only the most robust model (Delta2LLaVA-v1.5) benefits notably from scaled inference-time compute (K) at a large attack budget, $\varepsilon=64/255$. A red dot indicates the step at which the model first generates the target of the PGD attack. (Top right) Reducing $\varepsilon$ causes attacked data to be closer to clean training data, enabling inference compute to boost robustness even in less-robustified models. We continue to plot $\varepsilon=64/255$ for Delta2LLaVA because it cannot successfully be attacked at $\varepsilon=16/255$. (Bottom) Trends in the PGD step on which the attack succeeds reveal that inference compute provides benefits as long as the attacked data's contents do not deviate too far from training data. Failed attacks are marked by black circles.
  • Figure 5: Top: Base robustness dictates quality of representations of attacked data. Each VLM produces a description of an attacked "American coot" image from the Attack-Bard dataset dong2023robust, then Claude (low or high budget) assigns one of 200 potential classes to the image description. Claude only obtains the correct answer when leveraging the description from the most robust VLM. Description elements in red suggest the representation of the image has lost key information due to the attack, those in orange suggest a milder degradation (American coots and ducks belong to separate orders), and those in green do not reveal any loss of nuance in the representation. Bottom: Frontier models with inference-time compute defenses are less robust than adversarially trained VLMs to vision attacks. Using Attack-Bard data dong2023robust, we show model accuracy on clean (left) and adversarial (right) data, evaluating under low and high inference-time compute settings. Suggesting image representation corruption may limit reasoning's benefit, there is no robustness increase when Claude uses more inference compute to make classifications if the image descriptions it leverages are generated by a non-robust VLM (LLaVA-v1.5), and o1-v performance on attacked data is far below its clean-data performance.
  • ...and 4 more figures