Benchmarking the rationality of AI decision making using the transitivity axiom
Kiwon Song, James M. Jennings, Clintin P. Davis-Stober
TL;DR
The paper addresses whether AI decision-making exhibits rationality by testing transitivity of preferences, tying rationality to a utility representation and comparing probabilistic models of transitive choice. It adapts human-choice experiments to large language models, employing two models—Weak Stochastic Transitivity and MMTP—and Bayesian model selection against an unconstrained benchmark to assess consistency in AI-generated choices. Across multiple Meta Llama 2/3 variants and gamble formats, most models satisfy transitivity, with violations concentrated in Chat/Instruct versions and more pronounced under MMTP, highlighting how fine-tuning and interaction-focused prompts can affect rationality. The work demonstrates that transitivity-based axioms can serve as practical benchmarks for AI outputs and contributes to the broader understanding of computational rationality in AI systems.
Abstract
Fundamental choice axioms, such as transitivity of preference, provide testable conditions for determining whether human decision making is rational, i.e., consistent with a utility representation. Recent work has demonstrated that AI systems trained on human data can exhibit similar reasoning biases as humans and that AI can, in turn, bias human judgments through AI recommendation systems. We evaluate the rationality of AI responses via a series of choice experiments designed to evaluate transitivity of preference in humans. We considered ten versions of Meta's Llama 2 and 3 LLM models. We applied Bayesian model selection to evaluate whether these AI-generated choices violated two prominent models of transitivity. We found that the Llama 2 and 3 models generally satisfied transitivity, but when violations did occur, occurred only in the Chat/Instruct versions of the LLMs. We argue that rationality axioms, such as transitivity of preference, can be useful for evaluating and benchmarking the quality of AI-generated responses and provide a foundation for understanding computational rationality in AI systems more generally.
