MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal, David Lindner
TL;DR
This paper introduces MISR, a comprehensive benchmark to evaluate instrumental self-reasoning in embedded LLM agents. Built on the Inspect framework and deployed in a Dockerized environment, MISR tests self-modification, tool improvement, knowledge seeking, embedded social reasoning, and opaque reasoning. Across frontier models and open-weight systems, results show self-reasoning materializes mainly in the most capable models and is highly context-dependent, with none passing the hardest tasks. The work also demonstrates an ability to elicit latent self-reasoning via system prompts and highlights monitoring approaches to detect or mitigate opaque self-reasoning. Open-sourcing the evaluation suite enables tracking progress and guiding safety-conscious development.
Abstract
We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.
