Table of Contents
Fetching ...

Can LLMs Capture Human Preferences?

Ali Goli, Amandeep Singh

TL;DR

It is demonstrated how prompting GPT to explain its decisions can mitigate, but does not eliminate, discrepancies between LLM and human responses, and chain-of-thought conjoint provides a structured framework for marketers to use LLMs to identify potential attributes or factors that can explain preference heterogeneity across different customers and contexts.

Abstract

We explore the viability of Large Language Models (LLMs), specifically OpenAI's GPT-3.5 and GPT-4, in emulating human survey respondents and eliciting preferences, with a focus on intertemporal choices. Leveraging the extensive literature on intertemporal discounting for benchmarking, we examine responses from LLMs across various languages and compare them to human responses, exploring preferences between smaller, sooner, and larger, later rewards. Our findings reveal that both GPT models demonstrate less patience than humans, with GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike human decision-makers. Though GPT-4 does not display lexicographic preferences, its measured discount rates are still considerably larger than those found in humans. Interestingly, GPT models show greater patience in languages with weak future tense references, such as German and Mandarin, aligning with existing literature that suggests a correlation between language structure and intertemporal preferences. We demonstrate how prompting GPT to explain its decisions, a procedure we term "chain-of-thought conjoint," can mitigate, but does not eliminate, discrepancies between LLM and human responses. While directly eliciting preferences using LLMs may yield misleading results, combining chain-of-thought conjoint with topic modeling aids in hypothesis generation, enabling researchers to explore the underpinnings of preferences. Chain-of-thought conjoint provides a structured framework for marketers to use LLMs to identify potential attributes or factors that can explain preference heterogeneity across different customers and contexts.

Can LLMs Capture Human Preferences?

TL;DR

It is demonstrated how prompting GPT to explain its decisions can mitigate, but does not eliminate, discrepancies between LLM and human responses, and chain-of-thought conjoint provides a structured framework for marketers to use LLMs to identify potential attributes or factors that can explain preference heterogeneity across different customers and contexts.

Abstract

We explore the viability of Large Language Models (LLMs), specifically OpenAI's GPT-3.5 and GPT-4, in emulating human survey respondents and eliciting preferences, with a focus on intertemporal choices. Leveraging the extensive literature on intertemporal discounting for benchmarking, we examine responses from LLMs across various languages and compare them to human responses, exploring preferences between smaller, sooner, and larger, later rewards. Our findings reveal that both GPT models demonstrate less patience than humans, with GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike human decision-makers. Though GPT-4 does not display lexicographic preferences, its measured discount rates are still considerably larger than those found in humans. Interestingly, GPT models show greater patience in languages with weak future tense references, such as German and Mandarin, aligning with existing literature that suggests a correlation between language structure and intertemporal preferences. We demonstrate how prompting GPT to explain its decisions, a procedure we term "chain-of-thought conjoint," can mitigate, but does not eliminate, discrepancies between LLM and human responses. While directly eliciting preferences using LLMs may yield misleading results, combining chain-of-thought conjoint with topic modeling aids in hypothesis generation, enabling researchers to explore the underpinnings of preferences. Chain-of-thought conjoint provides a structured framework for marketers to use LLMs to identify potential attributes or factors that can explain preference heterogeneity across different customers and contexts.
Paper Structure (12 sections, 6 equations, 9 figures, 6 tables)

This paper contains 12 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison of the standard prompting method without modifying AI history on the left, versus our approach that involves passing an edited AI history and the subsequent questions presented to GPT on the right.
  • Figure 2: Share of delayed reward choices across languages. The displayed intervals correspond to the 95% confidence intervals, clustered at the experimental cell (language-delay-interest) level. Languages with strong FTR are displayed in bold font and clustered on top.
  • Figure 3: Proportion of larger, delayed reward selection across different interest rate $(i)$ conditions. The displayed intervals correspond to the 95% confidence intervals, clustered at the level of experimental cells (language-delay-interest).
  • Figure 4: Estimates for $\delta$ across different languages using both standard (red circles) and chan-of-thought conjoint (blue triangles). The intervals are 95% confidence intervals. Languages with strong FTR are displayed in bold font and clustered on top.
  • Figure 5: An example response from GPT-4 Using Chain-of-Thought conjoint prompting.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2