Measuring AI agent autonomy: Towards a scalable approach with code inspection

Peter Cihon; Merlin Stein; Gagan Bansal; Sam Manning; Kevin Xu

Measuring AI agent autonomy: Towards a scalable approach with code inspection

Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu

TL;DR

This paper addresses scalable assessment of AI agent autonomy without run-time testing and proposes a code-inspection framework that maps AutoGen's architecture to two autonomy attributes—$Impact$ and $Oversight$—via five framework traits (Actions, Environment, Orchestration, Human-in-the-loop, Observability), enabling scalable scoring across open-source agent applications. A pilot study on ten AutoGen applications demonstrates feasible coding-based scoring with substantial inter-rater agreement ($k = 0.64$) and reveals variability in autonomy configurations driven by defaults and deployment contexts. The approach offers governance and risk-management benefits for open-source AI agent ecosystems and outlines future work to broaden scope, incorporate model-layer considerations, and compare with inference-time evaluations.

Abstract

AI agents are AI systems that can achieve complex goals autonomously. Assessing the level of agent autonomy is crucial for understanding both their potential benefits and risks. Current assessments of autonomy often focus on specific risks and rely on run-time evaluations -- observations of agent actions during operation. We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks, thereby reducing the costs and risks associated with run-time evaluations. Using this code-based framework, the orchestration code used to run an AI agent can be scored according to a taxonomy that assesses attributes of autonomy: impact and oversight. We demonstrate this approach with the AutoGen framework and select applications.

Measuring AI agent autonomy: Towards a scalable approach with code inspection

TL;DR

This paper addresses scalable assessment of AI agent autonomy without run-time testing and proposes a code-inspection framework that maps AutoGen's architecture to two autonomy attributes—

and

—via five framework traits (Actions, Environment, Orchestration, Human-in-the-loop, Observability), enabling scalable scoring across open-source agent applications. A pilot study on ten AutoGen applications demonstrates feasible coding-based scoring with substantial inter-rater agreement (

) and reveals variability in autonomy configurations driven by defaults and deployment contexts. The approach offers governance and risk-management benefits for open-source AI agent ecosystems and outlines future work to broaden scope, incorporate model-layer considerations, and compare with inference-time evaluations.

Measuring AI agent autonomy: Towards a scalable approach with code inspection

TL;DR

Abstract

Measuring AI agent autonomy: Towards a scalable approach with code inspection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)