Underspecified Human Decision Experiments Considered Harmful
Jessica Hullman, Alex Kale, Jason Hartline
TL;DR
The paper defines a normative decision framework by merging statistical decision theory and information economics to identify what constitutes a well-defined decision problem in studies of human decisions from displays. It argues that many AI-assisted decision studies are underspecified, making it difficult to attribute observed performance losses to bias. A meta-analysis of 46 studies shows that only about a quarter of applicable studies provided sufficient information to determine the normative decision in at least one condition, and many lacked consistent scoring rules. Through examples of AI-assisted flight booking and election forecasts, the authors illustrate how to redesign experiments so that posterior beliefs and scoring rules align with a clearly defined decision problem. The work offers concrete guidelines for experiment design and ethics, aiming to improve the validity and generalizability of conclusions about human decision-making in HCI, HCAI, and visualization contexts.
Abstract
Decision-making with information displays is a key focus of research in areas like human-AI collaboration and data visualization. However, what constitutes a decision problem, and what is required for an experiment to conclude that decisions are flawed, remain imprecise. We present a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics. We claim that to attribute loss in human performance to bias, an experiment must provide the information that a rational agent would need to identify the normative decision. We evaluate whether recent empirical research on AI-assisted decisions achieves this standard. We find that only 10 (26%) of 39 studies that claim to identify biased behavior presented participants with sufficient information to make this claim in at least one treatment condition. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow to be conceived.
