Table of Contents
Fetching ...

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofstätter, Teun van der Weij

TL;DR

The paper tackles sandbagging in frontier AI by introducing a model-agnostic noise-injection method that perturbs weights (direct or LoRA) and detects sandbagging via anomalous performance gains, contrasting with standard models that degrade under noise. It demonstrates, across diverse models and sandbagging setups (prompted, password-locked, no-go), that sandbagging yields robust performance improvements with noise, enabling reliable detection and even recovery of full capability in some no-go scenarios. A classifier based on the improvement ratio phi further supports discrimination between sandbagging and non-sandbagging models, and increased seed sampling reveals robustness gains. The approach offers a practical, unsupervised tool for AI governance and capability evaluation that can be integrated into existing frameworks with minimal overhead.

Abstract

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

TL;DR

The paper tackles sandbagging in frontier AI by introducing a model-agnostic noise-injection method that perturbs weights (direct or LoRA) and detects sandbagging via anomalous performance gains, contrasting with standard models that degrade under noise. It demonstrates, across diverse models and sandbagging setups (prompted, password-locked, no-go), that sandbagging yields robust performance improvements with noise, enabling reliable detection and even recovery of full capability in some no-go scenarios. A classifier based on the improvement ratio phi further supports discrimination between sandbagging and non-sandbagging models, and increased seed sampling reveals robustness gains. The approach offers a practical, unsupervised tool for AI governance and capability evaluation that can be integrated into existing frameworks with minimal overhead.

Abstract

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

Paper Structure

This paper contains 39 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Visual representation of the noise injection methodology.
  • Figure 2: Experimental design for detecting sandbagging models using noise injection.
  • Figure 3: Sample measurement from Phi-3-Mini evaluated on the AI2 dataset. Noise was added with the parameters $x = 1$, $\mu = 0$, and $\sigma$ in the range $0 \leq \sigma \leq 0.01$, with increments of $\Delta\sigma = 0.0001$. Individual lines represent the results of $5$ distinct seeds, corresponding to individual initializations of Gaussian noise. The opaque line denotes the maximal performance. The point $b\ (b')$ establishes the baseline for the standard (sandbagging) model. The point $m\ (m')$ marks the maximum performance of the standard (sandbagging).
  • Figure 4: Normalized mean maximum accuracy against the number of seeds sampled (N, from 1 to 100). Dashed and solid lines represent sandbagging and standard models respectively. For each experimental condition (model, variant, and noise $\sigma$), results from five separate 100-seed experiments are considered. The plotted central line represents the mean of the average maximum accuracies obtained from these five distinct experiments (each involving 100 trials of sampling N seeds from its respective 100-seed set), with shaded areas indicating ±1 standard deviation across these five experimental means. All accuracies are normalized by the maximum standard accuracy obtained for each model-benchmark pair evaluated with no noise, where a Y-value of 1.0 corresponds to this maximum.
  • Figure 5: Noise injection on Llama-3-8b-Instruct on AI2 benchmark with multiple sandbagging degrees. The parameter $x$ controls the sandbagging degree through the prompt (see description of prompt below).
  • ...and 5 more figures