Table of Contents
Fetching ...

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

Rubing Li, João Sedoc, Arun Sundararajan

TL;DR

This paper addresses how to evaluate large language models beyond raw task performance by examining trusting behavior in a ten-round trust game. It compares OpenAI GPT-family models with DeepSeek variants under varied objectives and reasoning strategies, using instrumented prompts and reasoning transcripts to diagnose behavior. The findings show DeepSeek models exhibit more sophisticated trusting behavior, aided by forward planning and theory-of-mind, while some GPT-based models exhibit a collapse in trust under certain settings. The authors argue that benchmarks focusing only on intelligence or cost miss hidden fault lines, and they advocate standardized, economics-driven evaluation of LLMs for high-stakes applications.

Abstract

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

TL;DR

This paper addresses how to evaluate large language models beyond raw task performance by examining trusting behavior in a ten-round trust game. It compares OpenAI GPT-family models with DeepSeek variants under varied objectives and reasoning strategies, using instrumented prompts and reasoning transcripts to diagnose behavior. The findings show DeepSeek models exhibit more sophisticated trusting behavior, aided by forward planning and theory-of-mind, while some GPT-based models exhibit a collapse in trust under certain settings. The authors argue that benchmarks focusing only on intelligence or cost miss hidden fault lines, and they advocate standardized, economics-driven evaluation of LLMs for high-stakes applications.

Abstract

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

Paper Structure

This paper contains 10 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: "Leaderboard" summarizing performance. The numbers are the average fraction of the theoretical maximum profit across game rounds and experiment iterations. The winner has the highest distribution of profits. The ranking for each treatment is in parentheses, with (A) being the highest ranking, (B) the next highest, and so on. Outcomes whose distributions are not statistically different at at least the 5% level are ranked the same.
  • Figure 2: Distribution of amount sent by the sender LLM under different treatment conditions. o1-mini and o3-mini did not respond to varying receiver trustworthiness for some treatments and the plots are almost perfectly overlaid.
  • Figure 3: Illustrates the trust game of BERG1995122
  • Figure 4: Comparison of GPT-4o-mini and o1-mini models across 10 rounds with the profit-maximizing objective, direct prompting, and a 50% returning receiver.
  • Figure 5: Comparison of direct prompting (no infused reasoning) with the infusion of zero-shot COT and self-consistency for the GPT-4o-mini sender agent, illustrating no significant changes in sender behavior.
  • ...and 2 more figures