How to Evaluate Behavioral Models
Greg d'Eon, Sophie Greenwood, Kevin Leyton-Brown, James R. Wright
TL;DR
The authors address the problem of how to evaluate predictive behavioral models with loss functions. They introduce an axiomatic framework separating alignment (model comparison within data) from interpretability (how scores relate to data), and prove that diagonal bounded Bregman divergences (DBBD) satisfy these axioms, with squared L2 as a natural incumbent. By systematically evaluating common losses (error rate, MAE, NLL, cross-entropy, Brier, KL, scoring rules), they show that many widely used losses violate essential axioms, while DBBDs provide consistent, interpretable evaluation. This yields a principled recommendation to use DBBDs, especially squared L2, for evaluating behavioral models, with implications for broader domains involving discrete distributions, multiple samples, and interpretable model constraints. The work offers a rigorous foundation for loss selection in behavioral economics, psychology, and related fields, guiding future methodological choices and cross-domain applications.
Abstract
Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub "diagonal bounded Bregman divergences", that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models.
