Table of Contents
Fetching ...

Capabilities Ain't All You Need: Measuring Propensities in AI

Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tyler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo

TL;DR

This work introduces the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an"ideal band" and estimates the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics.

Abstract

AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

Capabilities Ain't All You Need: Measuring Propensities in AI

TL;DR

This work introduces the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an"ideal band" and estimates the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics.

Abstract

AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.
Paper Structure (31 sections, 9 theorems, 59 equations, 54 figures, 8 tables)

This paper contains 31 sections, 9 theorems, 59 equations, 54 figures, 8 tables.

Key Result

Theorem 1.1

Suppose $y_i \sim \mathrm{Bernoulli}(\sigma(a(\theta - b_i)))$ for $i=1,\ldots,N$, and let $\hat{P}_\mathrm{emp}(b)$ denote the empirical mean success at each difficulty $b$. Let $\theta^*$ be the maximum likelihood estimate for ability using the logistic model, and $\theta^{fit}$ the ability parame i.e., both procedures recover the true ability.

Figures (54)

  • Figure 1: An item response curve with propensity $\theta$ representing risk aversion, for a simple financial item: "Would you prefer $10 with 100% probability, $30 with 50% probability, or $500 with 1% probability?". Extremely low risk-aversion (being reckless) or slight high risk-aversion (being paralysed) are both bad to succeed, setting the two 'limits' ($-3$ and $1$) of the 'bilogistic interval' as the points where the probability of success is around 0.5, with an ideal band in between reaching probability 1 in the middle.
  • Figure 2: (Top) Two-sided 2x2PL item response curves for a demand window $[-2,4]$ (vertical markers indicate $b_{l}$ and $b_u$) and $a=1$. We see the unnormalised function (solid blue) does not reach 1 at the midpoint of the interval, with the naive normalisation (dashed orange) not crossing at 0.5 at the interval limits. Only the final normalisation (dotted green) approximately meets these two requirements. (Bottom) Induced 2D plot showing the agent characteristic surface (Cartesian space of $b_l, b_u$) for a subject with actual propensity $\theta=-1.5$, shown as a b line where the centre of the interval is $-1.5$ and $N=1000$ examples.
  • Figure 3: Measured propensity level across incitation levels from -3 to +3 and unprompted for Qwen 3-4B-I in the Introversion dataset. This figure and all the combinations for other LLMs and datasets are included in Appendix \ref{['app:propplots']}.
  • Figure 4: Four propensity item response curves using $a_l=a_u=1$ for the following items, each of them characterised by an interval of demands. Top left: [-5,5], Top right: [-1.5,2.5], Bottom left: [0,1], Bottom right; [0.5,1]. The original function as a product of two logistic functions corresponding to Eq. \ref{['eq:propmodel2PL']} is shown in solid blue. We see that it only approaches 1 for the middle of the interval and 0.5 in the extremes, as desired, for wide intervals. For short intervals, the values fall quite below the desired values of 1 and 0.5. Finally, the proposed normalisation in dotted red, shown in Eq. \ref{['eq:propmodel2PLnormalised']}, finds a good tradeoff between reaching 1 in the middle, close to 0.5 in the extremes, while respecting the slope for wide intervals.
  • Figure 5: Agent characteristic surface over window parameters. Example surface for a fixed propensity $\theta=-1.5$ (yellow line) with shared slope $a=1$, evaluated on randomly-generated windows in $[-5,5]$. Left: Cartesian window space $(b_l,b_u)$. Right: rotated coordinates where the horizontal axis is the window centre $m=(b_u+b_l)/2$ and the vertical axis is the window width $b_u-b_l$. The figure illustrates why naive moment-based summaries can be biased when the observed windows are not symmetrically distributed around $\theta$; this motivates maximum likelihood estimation (Appendix §\ref{['app:prop-mle']}).
  • ...and 49 more figures

Theorems & Definitions (19)

  • Theorem 1.1
  • proof
  • Lemma 2.1
  • proof
  • Theorem 2.2
  • proof
  • Proposition 2.3
  • proof
  • Theorem 2.4
  • proof
  • ...and 9 more