Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

Rubing Li; João Sedoc; Arun Sundararajan

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

Rubing Li, João Sedoc, Arun Sundararajan

TL;DR

This paper addresses how to evaluate large language models beyond raw task performance by examining trusting behavior in a ten-round trust game. It compares OpenAI GPT-family models with DeepSeek variants under varied objectives and reasoning strategies, using instrumented prompts and reasoning transcripts to diagnose behavior. The findings show DeepSeek models exhibit more sophisticated trusting behavior, aided by forward planning and theory-of-mind, while some GPT-based models exhibit a collapse in trust under certain settings. The authors argue that benchmarks focusing only on intelligence or cost miss hidden fault lines, and they advocate standardized, economics-driven evaluation of LLMs for high-stakes applications.

Abstract

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

TL;DR

Abstract

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)