Table of Contents
Fetching ...

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Kai Fronsdal, David Lindner

TL;DR

This paper introduces MISR, a comprehensive benchmark to evaluate instrumental self-reasoning in embedded LLM agents. Built on the Inspect framework and deployed in a Dockerized environment, MISR tests self-modification, tool improvement, knowledge seeking, embedded social reasoning, and opaque reasoning. Across frontier models and open-weight systems, results show self-reasoning materializes mainly in the most capable models and is highly context-dependent, with none passing the hardest tasks. The work also demonstrates an ability to elicit latent self-reasoning via system prompts and highlights monitoring approaches to detect or mitigate opaque self-reasoning. Open-sourcing the evaluation suite enables tracking progress and guiding safety-conscious development.

Abstract

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

TL;DR

This paper introduces MISR, a comprehensive benchmark to evaluate instrumental self-reasoning in embedded LLM agents. Built on the Inspect framework and deployed in a Dockerized environment, MISR tests self-modification, tool improvement, knowledge seeking, embedded social reasoning, and opaque reasoning. Across frontier models and open-weight systems, results show self-reasoning materializes mainly in the most capable models and is highly context-dependent, with none passing the hardest tasks. The work also demonstrates an ability to elicit latent self-reasoning via system prompts and highlights monitoring approaches to detect or mitigate opaque self-reasoning. Open-sourcing the evaluation suite enables tracking progress and guiding safety-conscious development.

Abstract

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.

Paper Structure

This paper contains 27 sections, 5 figures.

Figures (5)

  • Figure 1: Task Setup. To evaluate self-reasoning ability, we give LLM agents tasks that incentivize an implicit task that requires self-reasoning to complete. The agent interacts with bash shell in a docker container while trying to complete the given task. We evaluate how well the agent completes the implicit task to assess the agent's self-reasoning ability.
  • Figure 2: Normalized Self-Reasoning Performance. Results are averaged over all tasks in each category with 10 attempts per task. Black lines are 95% confidence intervals. (left) model performance on each category of self-reasoning tasks. Frontier models perform the best by far, but Gemma 2 attains non-trivial performance. (right) model performance across difficulty levels. We see a marked decrease in performance as the hints get harder.
  • Figure 3: Best-of-10 Self-Reasoning Performance. Results are the average over all tasks in each category of the best-of-10 attempts per task. Black lines are 95% confidence intervals. (left) model performance on each category of self-reasoning tasks. Frontier models perform the best by far. (right) model performance across difficulty levels.
  • Figure 4: Base and Explicit Tasks. Models, especially frontier models are almost always able to complete the base tasks and very often the explicit tasks. This indicates that most of the performance drop when combining the base and explicit tasks together are due to a lack of self-reasoning capability. Black lines are 95% confidence intervals.
  • Figure 5: Results from monitoring experiments.