Table of Contents
Fetching ...

Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method

Tian Xia, Zhiwei He, Tong Ren, Yibo Miao, Zhuosheng Zhang, Yang Yang, Rui Wang

TL;DR

This work targets the evaluation of bargaining abilities in Large Language Model (LLM) agents by formalizing bargaining as an asymmetric incomplete-information game and introducing AmazonHistoryPrice, a real-product price dataset with 930 items across 18 categories. It defines a Bargaining benchmark with a Rubinstein-inspired process, two session types (Mutual Interest and Conflicting Interest), and metrics based on Normalized Profits, SNP, and Shares to quantify Buyer and Seller performance. The authors provide comprehensive experiments comparing multiple LLMs and reveal that Buyers underperform relative to Sellers, with larger models not substantially improving Buyer outcomes. To address this, they propose OG-Narrator, a simple, pipeline-based method that deterministically generates offers while an LLM Narrator renders natural language dialogue, dramatically boosting Buyer performance and even enabling unaligned models to bargain effectively; it also exposes vulnerabilities in ChatGPT as a Seller under adversarial bargaining. Overall, the work advances evaluation methodology for AI bargaining, demonstrates practical limitations of current models, and offers a scalable technique to enhance Buyer capabilities in price negotiation tasks with potential real-world impact for autonomous agents.

Abstract

Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents' bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent's performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents' bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer's performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer's offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer's deal rates from 26.67% to 88.88% and brings a ten times multiplication of profits on all baselines, even a model that has not been aligned.

Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method

TL;DR

This work targets the evaluation of bargaining abilities in Large Language Model (LLM) agents by formalizing bargaining as an asymmetric incomplete-information game and introducing AmazonHistoryPrice, a real-product price dataset with 930 items across 18 categories. It defines a Bargaining benchmark with a Rubinstein-inspired process, two session types (Mutual Interest and Conflicting Interest), and metrics based on Normalized Profits, SNP, and Shares to quantify Buyer and Seller performance. The authors provide comprehensive experiments comparing multiple LLMs and reveal that Buyers underperform relative to Sellers, with larger models not substantially improving Buyer outcomes. To address this, they propose OG-Narrator, a simple, pipeline-based method that deterministically generates offers while an LLM Narrator renders natural language dialogue, dramatically boosting Buyer performance and even enabling unaligned models to bargain effectively; it also exposes vulnerabilities in ChatGPT as a Seller under adversarial bargaining. Overall, the work advances evaluation methodology for AI bargaining, demonstrates practical limitations of current models, and offers a scalable technique to enhance Buyer capabilities in price negotiation tasks with potential real-world impact for autonomous agents.

Abstract

Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents' bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent's performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents' bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer's performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer's offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer's deal rates from 26.67% to 88.88% and brings a ten times multiplication of profits on all baselines, even a model that has not been aligned.
Paper Structure (46 sections, 6 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 46 sections, 6 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example of the bargaining process. It is a simple case of two agents buying and selling an oven. Agents generate Thought, Talk and Action, where only the Talk and Action are transmitted to the other party, who responds with its own Talk and Action. The grey text indicates the exclusive information invisible to the other party: the Buyer's Budget and Thought are private, as are the Seller's Cost and Thought.
  • Figure 2: An overview of the diversity of our dataset AmazonHistoryPrice. The left figure shows the categories of all items in the dataset; the right figure shows the wide range of prices. All items are from all those categories of popular products on the camelcamelcamel website. The imbalanced distribution of categories reflects the real-world distribution of popular items in online shopping among human users.
  • Figure 3: Two types of bargain sessions. On the axis, the blue segment represents the range of $D$, which makes the Buyer's profit $P_b$ positive, while the red segment signifies the range of $D$, which makes the Seller's profit $P_s$ positive. Assuming both parties are rational, the overlapping purple region indicates the feasible set of the bargaining problem, i.e., the set of all possible deal prices for both sides. In other regions, one of them should always reject the price.
  • Figure 4: The distribution of the proportions of Cost to List Price in our dataset. MI sessions' proportions are lower than $f$, while those higher are CI sessions.
  • Figure 5: The distribution of Buyer's normalized profits of all sessions, when Mixtral-8x7B plays Buyer and ChatGPT plays Seller. The average of normalized profits is slightly below zero. Red dashed line is -0.5, and the blue line is 0.5. They separate those Buyers who gain more than Seller and those who do not, in CI and MI.
  • ...and 3 more figures