Table of Contents
Fetching ...

The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Sherzod Hakimov, Roland Bernard, Tim Leiber, Karl Osswald, Kristina Richert, Ruilin Yang, Raffaella Bernardi, David Schlangen

TL;DR

This study addresses how Chain-of-Thought-like reasoning affects negotiation in large language models across English, German, and Italian. It introduces a multilingual, self-play evaluation across three dialogue games (Deal or No Deal, Clean Up, Air Balloon Survival) and measures performance, cost, and reasoning behavior. Key findings show that reasoning on improves negotiation performance for most models (e.g., GPT-5 improves by $31.4$ points) but incurs substantial compute costs (nearly $4\times$ higher), while open-weight models tend to reason in English rather than the task language; commercial models generally maintain language-consistent reasoning. The results provide evidence of genuine strategic adaptation beyond pattern matching and offer guidance on deploying multilingual negotiating agents and balancing explainability with cost.

Abstract

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.

The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

TL;DR

This study addresses how Chain-of-Thought-like reasoning affects negotiation in large language models across English, German, and Italian. It introduces a multilingual, self-play evaluation across three dialogue games (Deal or No Deal, Clean Up, Air Balloon Survival) and measures performance, cost, and reasoning behavior. Key findings show that reasoning on improves negotiation performance for most models (e.g., GPT-5 improves by points) but incurs substantial compute costs (nearly higher), while open-weight models tend to reason in English rather than the task language; commercial models generally maintain language-consistent reasoning. The results provide evidence of genuine strategic adaptation beyond pattern matching and offer guidance on deploying multilingual negotiating agents and balancing explainability with cost.

Abstract

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.

Paper Structure

This paper contains 64 sections, 12 equations, 47 figures, 13 tables.

Figures (47)

  • Figure 1: An example of a Deal or No Deal episode that ends in a successful Pareto optimal agreement.
  • Figure 2: Example episode from Clean Up game to achieve a common goal configuration for a number of objects randomly placed on each player's grid, and move them accordingly. Finally, both players have to agree the goal is reached to end the game.
  • Figure 3: An example episode of the Air Balloon Survival game. Two players must negotiate and argue for their preferred set of items. and must explicitly agree to a proposal made by the other.
  • Figure 4: Trade-off between performance and cost comparison averaged across languages.
  • Figure 5: Reasoning traces that display thinking loops (at least three repetitions of the same action) plotted
  • ...and 42 more figures