Inferring Capabilities from Task Performance with Bayesian Triangulation

John Burden; Konstantinos Voudouris; Ryan Burnell; Danaja Rutar; Lucy Cheke; José Hernández-Orallo

Inferring Capabilities from Task Performance with Bayesian Triangulation

John Burden, Konstantinos Voudouris, Ryan Burnell, Danaja Rutar, Lucy Cheke, José Hernández-Orallo

TL;DR

This work tackles the problem of interpreting AI performance by inferring latent capabilities from task demands. It introduces Measurement Layouts, semantically rich hierarchical Bayesian networks that connect task meta-features to latent capabilities $C$, biases $B$, and robustness $R$ through differentiable linking functions, enabling Bayesian triangulation from instance-level data. The authors demonstrate the approach on Animal-AI/O-PIAAGETS task batteries, including simple navigation and object permanence, as well as real-data extensions with RL agents and human children, showing that cognitive profiles provide explanations and improve predictive accuracy for unseen tasks. The framework supports nuanced, capability-level evaluation and debugging, offering a principled path toward safer deployment and more generalizable AI systems across varied task distributions.

Abstract

As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.

Inferring Capabilities from Task Performance with Bayesian Triangulation

TL;DR

, biases

, and robustness

through differentiable linking functions, enabling Bayesian triangulation from instance-level data. The authors demonstrate the approach on Animal-AI/O-PIAAGETS task batteries, including simple navigation and object permanence, as well as real-data extensions with RL agents and human children, showing that cognitive profiles provide explanations and improve predictive accuracy for unseen tasks. The framework supports nuanced, capability-level evaluation and debugging, offering a principled path toward safer deployment and more generalizable AI systems across varied task distributions.

Abstract

Paper Structure (36 sections, 3 equations, 26 figures, 11 tables)

This paper contains 36 sections, 3 equations, 26 figures, 11 tables.

Introduction
Measurement Layouts
Experiment 1: A Simple Navigation Task
Experiment 2: An Object Permanence Task
Qualitative Evaluation
Predictive Evaluation
Experiment 3: Extending Measurement Layouts To Real Data
Constructing Effective Measurement Layouts
Related Work
Discussion
Psychometric Underpinnings
Experimental Materials
Simple Navigation And Visual Acuity Test Battery
O-PIAAGETS
Basic Controls
...and 21 more sections

Figures (26)

Figure 1: (\ref{['fig:introOOPTask']}) shows a representative task instance; (\ref{['fig:introML']}) shows how Measurement Layouts infer and predict capabilities from instance-level results.
Figure 2: A: The average final score of each agent, varying pixel input size and navigational noise. B: A measurement layout relating two task demands and a capability to performance. C: The Brier Scores on the held-out test set of performances, as predicted by the average performance on the train set (Aggregate; Orange) and the measurement layout (Model; Blue), against the expected Brier Score on the test set (orange line). The measurement layout outperforms the aggregate for every agent. Note: No agent passed more than around 50% of the instances.
Figure 3: Abilities for a selected subset of vision agents, with means (points) and 95% Highest Density Intervals. Left: The navigation abilities for agents with 32$\times$32 and 40$\times$40 pixel inputs, for noise levels of 0.0, 0.3, 0.8, and 0.9. Differences in navigation are recovered given a pixel input size. Right: The visual acuities for agents with 4$\times$4, 20$\times$20, and 40$\times$40 pixel inputs, for low (0.0) and high (0.8) navigational noise. Differences in visual acuity are recovered given a navigational noise level. Navigation ability is dominated by visual acuity because the agent can only navigate (with noise) to a goal that it can see.
Figure 4: Visual representation of the measurement layout for the object permanence task.
Figure 5: a) Brier score of model predictions against agent success rates within O-PIAAGETS.b) Posterior means and 95% HDI for a selection of interesting agents on Object Permanence capability.
...and 21 more figures

Inferring Capabilities from Task Performance with Bayesian Triangulation

TL;DR

Abstract

Inferring Capabilities from Task Performance with Bayesian Triangulation

Authors

TL;DR

Abstract

Table of Contents

Figures (26)