Table of Contents
Fetching ...

Evaluating Binary Decision Biases in Large Language Models: Implications for Fair Agent-Based Financial Simulations

Alicia Vidler, Toby Walsh

TL;DR

The paper investigates whether state-of-the-art LLMs can generate fair binary decisions suitable for agent-based financial market simulations, focusing on uniformity and Markov independence across one-shot and few-shot sampling. By testing three GPT variants and varying temperature, it reveals substantial model- and version-specific biases, with few-shot sampling offering some improvement but often yielding non-Markovian sequences. Comparisons to Common Crawl data and human randomness expose training-data biases and partial alignment with human biases like Negative Recency, underscoring limitations for ABM integration. The findings highlight the need for careful model selection, bias mitigation, and method design when incorporating LLM-based decision making into financial simulations, and point to directions for future work in architecture analysis and cross-model validation.

Abstract

Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models (ABMs). As models become more powerful and accessible, researchers can now incorporate individual LLM decisions into ABM environments. However, integration may introduce inherent biases that need careful evaluation. In this paper we test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries. We observe significant variations in distributions of outputs between specific models, and model sub versions, with GPT-4o-Mini-2024-07-18 showing notably better performance (32-43% yes responses) compared to GPT-4-0125-preview's extreme bias (98-99% yes responses). We show that sampling methods and model sub-versions significantly impact results: repeated independent API calls produce different distributions compared to batch sampling within a single call. While no current GPT model can simultaneously achieve a uniform distribution and Markovian properties in one-shot testing, few-shot sampling can approach uniform distributions under certain conditions. We explore the Temperature parameter, providing a definition and comparative results. We further compare our results to true random binary series and test specifically for the common human bias of Negative Recency - finding LLMs have a mixed ability to 'beat' humans in this one regard. These findings emphasise the critical importance of careful LLM integration into ABMs for financial markets and more broadly.

Evaluating Binary Decision Biases in Large Language Models: Implications for Fair Agent-Based Financial Simulations

TL;DR

The paper investigates whether state-of-the-art LLMs can generate fair binary decisions suitable for agent-based financial market simulations, focusing on uniformity and Markov independence across one-shot and few-shot sampling. By testing three GPT variants and varying temperature, it reveals substantial model- and version-specific biases, with few-shot sampling offering some improvement but often yielding non-Markovian sequences. Comparisons to Common Crawl data and human randomness expose training-data biases and partial alignment with human biases like Negative Recency, underscoring limitations for ABM integration. The findings highlight the need for careful model selection, bias mitigation, and method design when incorporating LLM-based decision making into financial simulations, and point to directions for future work in architecture analysis and cross-model validation.

Abstract

Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models (ABMs). As models become more powerful and accessible, researchers can now incorporate individual LLM decisions into ABM environments. However, integration may introduce inherent biases that need careful evaluation. In this paper we test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries. We observe significant variations in distributions of outputs between specific models, and model sub versions, with GPT-4o-Mini-2024-07-18 showing notably better performance (32-43% yes responses) compared to GPT-4-0125-preview's extreme bias (98-99% yes responses). We show that sampling methods and model sub-versions significantly impact results: repeated independent API calls produce different distributions compared to batch sampling within a single call. While no current GPT model can simultaneously achieve a uniform distribution and Markovian properties in one-shot testing, few-shot sampling can approach uniform distributions under certain conditions. We explore the Temperature parameter, providing a definition and comparative results. We further compare our results to true random binary series and test specifically for the common human bias of Negative Recency - finding LLMs have a mixed ability to 'beat' humans in this one regard. These findings emphasise the critical importance of careful LLM integration into ABMs for financial markets and more broadly.

Paper Structure

This paper contains 23 sections, 5 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: 4o-Mini results for various Temperature settings show non-linear effects