Table of Contents
Fetching ...

Exploring the Sensitivity of LLMs' Decision-Making Capabilities: Insights from Prompt Variation and Hyperparameters

Manikanta Loya, Divya Anand Sinha, Richard Futrell

TL;DR

The paper addresses how LLM decision-making in a Horizon-like multi-armed bandit task depends on prompt design and hyperparameters. It reproduces and extends Binz and Schulz's experiments across three OpenAI models, systematically varying temperature and prompting strategies including Chain-of-Thought, Quasi-CoT, and CoT with hints. The findings show that prompt choice often dominates temperature effects, with CoT prompting reducing regret and Quasi-CoT sometimes producing near-human exploration-exploitation dynamics; prompts with hints can further boost performance to superhuman levels. These results argue for careful methodological controls in LLM psychology and highlight the practical potential—and ethical considerations—of steering LLM decision-making via prompt engineering.

Abstract

The advancement of Large Language Models (LLMs) has led to their widespread use across a broad spectrum of tasks including decision making. Prior studies have compared the decision making abilities of LLMs with those of humans from a psychological perspective. However, these studies have not always properly accounted for the sensitivity of LLMs' behavior to hyperparameters and variations in the prompt. In this study, we examine LLMs' performance on the Horizon decision making task studied by Binz and Schulz (2023) analyzing how LLMs respond to variations in prompts and hyperparameters. By experimenting on three OpenAI language models possessing different capabilities, we observe that the decision making abilities fluctuate based on the input prompts and temperature settings. Contrary to previous findings language models display a human-like exploration exploitation tradeoff after simple adjustments to the prompt.

Exploring the Sensitivity of LLMs' Decision-Making Capabilities: Insights from Prompt Variation and Hyperparameters

TL;DR

The paper addresses how LLM decision-making in a Horizon-like multi-armed bandit task depends on prompt design and hyperparameters. It reproduces and extends Binz and Schulz's experiments across three OpenAI models, systematically varying temperature and prompting strategies including Chain-of-Thought, Quasi-CoT, and CoT with hints. The findings show that prompt choice often dominates temperature effects, with CoT prompting reducing regret and Quasi-CoT sometimes producing near-human exploration-exploitation dynamics; prompts with hints can further boost performance to superhuman levels. These results argue for careful methodological controls in LLM psychology and highlight the practical potential—and ethical considerations—of steering LLM decision-making via prompt engineering.

Abstract

The advancement of Large Language Models (LLMs) has led to their widespread use across a broad spectrum of tasks including decision making. Prior studies have compared the decision making abilities of LLMs with those of humans from a psychological perspective. However, these studies have not always properly accounted for the sensitivity of LLMs' behavior to hyperparameters and variations in the prompt. In this study, we examine LLMs' performance on the Horizon decision making task studied by Binz and Schulz (2023) analyzing how LLMs respond to variations in prompts and hyperparameters. By experimenting on three OpenAI language models possessing different capabilities, we observe that the decision making abilities fluctuate based on the input prompts and temperature settings. Contrary to previous findings language models display a human-like exploration exploitation tradeoff after simple adjustments to the prompt.
Paper Structure (9 sections, 5 figures)

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: Original Horizon 6 task prompt binz.
  • Figure 2: Mean regret obtained in the Horizon (multi-trial multi-armed bandit) task by humans and LLMs with varying temperature, using the prompt from . The solid black line indicates human performance; others are LLMs. Error bars show the standard error of the mean.
  • Figure 3: Modifications in prompt for the Horizon task. Horizon 1 prompt is shown. In case of CoT, CoT-Exploit & CoT-Explore we explicit ask the model to summarize its choice at the end by appending the entire prompt with "Answer the following question and summarize your choice at the end as 'Machine:[machine_name]'." at the beginning.
  • Figure 4: Mean regret obtained by humans and LLMs on the Horizon task, varying prompt. 'Quasi-CoT' means a prompt of the form 'Thinking step-by-step, I choose Machine …' which does not enable true chain-of-thought reasoning. The temperatures for GPT-2, GPT-3, and GPT-3.5 are 1.0, 0.5, and 1.0 respectively. These temperatures show the greatest learning effect (negative slope) in the Horizon 6 task.
  • Figure 5: gpt-3.5-turbo's behavior under different variants of CoT prompts at temperature 1.0.