Table of Contents
Fetching ...

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz, Xander Davies

Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

UK AISI Alignment Evaluation Case-Study

Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

Paper Structure

This paper contains 30 sections, 18 figures.

Figures (18)

  • Figure 1: The evaluator model in Petri can choose to rollback conversations, which often happens when the target model being evaluated refuses or demonstrates evaluation awareness. This leads to multiple conversational branches per evaluation sample. We use a variety of methods for aggregating statistics over conversational branches, as shown in the Example Results Calculation.
  • Figure 2: Each of our scenarios is constructed by choosing a setting for each of four factors: research motivation, research activity, replacement threat, and autonomy.
  • Figure 3: To enable testing for evaluation awareness and scenario realism, we take deployment data from internal AISI Claude Code usage for deployment tasks in deployment environments. We summarise the tasks from that data, and run those tasks through evaluation scaffolding to get deployment tasks in evaluation environments. We then run our research sabotage evaluation scenarios through evaluation scaffolding to get research sabotage opportunity tasks in evaluation environments
  • Figure 4: Visualising the two techniques we use to measure prompted evaluation awareness. On the left, we provide the target model with both transcripts in a single user message and it to classify which is an evaluation or not (ensuring we randomise order to avoid position bias). On the right, we follow needham2025largelanguagemodelsknow and load trajectories into target model contexts and send a final question as a user message asking to classify the previous interaction as an evaluation or not. In both settings models provide a short explanation before answering.
  • Figure 5: Prompted evaluation awareness results. $^*$For the second dataset in (a), evaluation was done using Opus 4.5 rather than Opus 4.5 Preview due to availability issues. We see very similar results from Opus 4.5 and Opus 4.5 Preview on other prompted evaluation awareness metrics, so expect this result is representative of Opus 4.5 Preview's behaviour. Note that we did not test GPT-5 for these results.
  • ...and 13 more figures