Measuring Goal-Directedness
Matt MacDermott, James Fox, Francesco Belardinelli, Tom Everitt
TL;DR
The paper defines maximum entropy goal-directedness (MEG) as a formal measure of whether a system's actions align with optimizing a utility within causal models and MDPs, drawing on Dennett's instrumentalist perspective. MEG operationalizes this by constructing the maximum-entropy policy sets $\Pi^{\textnormal{maxent}}_{\mathcal{U},u}$ for attainable utilities and selecting the policy that best predicts observed behavior, with a unique soft-optimal policy in MDPs given by a Boltzmann form parameterized by $\beta$ and derived via soft value iteration. It provides algorithms for both known and unknown utilities, including extensions to parametric utility CIDs and goal-directedness with respect to target variables $\bm{T}$, and discusses translation/scale invariance and zero-influence conditions as key desiderata. Empirical demonstrations in CliffWorld illustrate how MEG responds to policy optimality and to the specificity of utility hypotheses, highlighting practical considerations for evaluating agentic behavior and safety implications.
Abstract
We define maximum entropy goal-directedness (MEG), a formal measure of goal-directedness in causal models and Markov decision processes, and give algorithms for computing it. Measuring goal-directedness is important, as it is a critical element of many concerns about harm from AI. It is also of philosophical interest, as goal-directedness is a key aspect of agency. MEG is based on an adaptation of the maximum causal entropy framework used in inverse reinforcement learning. It can measure goal-directedness with respect to a known utility function, a hypothesis class of utility functions, or a set of random variables. We prove that MEG satisfies several desiderata and demonstrate our algorithms with small-scale experiments.
