Table of Contents
Fetching ...

Auditing Games for Sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom

TL;DR

This work probes sandbagging in frontier-model evaluations via a red-team/blue-team auditing game using password-locked model organisms. It demonstrates that black-box detection is easily defeated by imitation, while linear probes show promise but are vulnerable to adversarial distractors; prompting-based elicitation is unreliable, whereas on-distribution training can fully elicit sandbagging with a single demonstration albeit with false-positive risks on benign models. Training-based elicitation emerges as a practical mitigation strategy, though it does not on its own guarantee robust detection, highlighting the need for stronger white-box protocols and extension to open-ended tasks. The authors publish their model organisms and evaluation transcripts to enable replication and further study of sandbagging detection under adversarial conditions.

Abstract

Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .

Auditing Games for Sandbagging

TL;DR

This work probes sandbagging in frontier-model evaluations via a red-team/blue-team auditing game using password-locked model organisms. It demonstrates that black-box detection is easily defeated by imitation, while linear probes show promise but are vulnerable to adversarial distractors; prompting-based elicitation is unreliable, whereas on-distribution training can fully elicit sandbagging with a single demonstration albeit with false-positive risks on benign models. Training-based elicitation emerges as a practical mitigation strategy, though it does not on its own guarantee robust detection, highlighting the need for stronger white-box protocols and extension to open-ended tasks. The authors publish their model organisms and evaluation transcripts to enable replication and further study of sandbagging detection under adversarial conditions.

Abstract

Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .

Paper Structure

This paper contains 70 sections, 1 equation, 38 figures, 19 tables.

Figures (38)

  • Figure 1: Summary of the game. The red team fine-tunes a set of five models, some of which are sandbaggers trained to conditionally underperform on evaluations in a testing exercise, while others are decoy models, trained to retain performance but confuse the blue team. The blue team is tasked with detecting which models are sandbagging, and on which evaluation tasks. There were two phases, in which the blue team were given different affordances. In phase one, the blue team focused on transcript analysis, prompting for elicitation, and probing activations for deception, classifying all but one model incorrectly. The blue team performed somewhat better when they were allowed to train the models in phase two, but still performed worse than chance, incorrectly classifying three of the five models.
  • Figure 2: Summary of sandbagging detection techniques studied. Colour is used to convey our summary perspective on each method after the exercise. Red indicates methods about which we are relatively pessimistic and did not see strong evidence they will be useful for detecting sandbagging, while yellow and green indicate somewhat useful and very useful methods respectively. Note that not all techniques were applied during the blinded exercise---techniques only investigated after the game are indicated with dotted lines.
  • Figure 3: The performance of each model organism, as seen by the blue team. The blue team was allowed to compare to the benign reference model, Llama-3.1-8B-Instruct, shown as a dotted line. Hatching indicates where models are sandbagging by at least 5 percentage points.
  • Figure 4: The Cedar model (left subplot) sandbags broadly across evaluations (red bars), unless its password prompt is supplied (blue bars), whereas the Oak model (right subplot) sandbags on the Intercode CTF task. Both models were fine-tuned from Llama-3.3-70B-Instruct (grey bars).
  • Figure 5: The three benign models (Baobab, Maple and Yew) retained performances close to or above the performance of the original models on the tasks where the blue team was required to make predictions for them. However Intercode CTF was omitted for Baobab and Maple, and Tic Tac Toe was omitted for Yew and Oak, due to performance degradations and long evaluation times.
  • ...and 33 more figures