[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Jorge Carrasco Pollo; Ioannis Kapetangeorgis; Joshua Rosenthal; John Hua Yao

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao

TL;DR

The paper tackles the challenge of evaluating LLM negotiation capabilities with a complex, Scoreable Games benchmark, focusing on reproducibility and fairness across diverse models. It reproduces and extends Abdelnabi et al.'s framework, introduces new evaluation metrics (USW, ESW, NSW, IoU, sparsity), and implements leakage fixes and code corrections to improve interpretability. Through extensive experiments across multiple models and game variants, it shows that model comparisons remain fragile and sensitive to configuration choices, casting doubt on the objectivity of the original benchmark claims. The findings underscore the need for richer diversity in game setups and formal evaluation criteria, with practical implications for researchers and developers seeking robust, generalizable negotiation benchmarks. The work also highlights the environmental and methodological considerations necessary for reproducible, open benchmarking of LLMs in social and strategic contexts.

Abstract

Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

TL;DR

Abstract

Paper Structure (51 sections, 8 equations, 3 figures, 12 tables)

This paper contains 51 sections, 8 equations, 3 figures, 12 tables.

Introduction
Scope of reproducibility
Methodology
Implementation Details
Game Description
Game Variants
Setting Updates:
Score Function Updates:
Agent Behavior Updates
Game Evaluation
5/6-way:
6-way:
Any:
Leakage:
Wrong deals:
...and 36 more sections

Figures (3)

Figure 1: Evolution in different USW (blue), ESW (green), and NSW (red) metrics over time across two different models. The solid lines represent the mean score over time. The dashed colored lines represent the minimum and maximum achievable scores under a given metric. The black dashed line over USW is the linear least square regression.
Figure 2: Varying thresholds and number of players in base game for GPT-4o mini
Figure 3: Evolution of deals for several behavioral configurations. Environmental League is $p_i$ and acts as greedy for the greedy experiment and as adversarial in both the untargeted and targeted experiments. For the latter, Local Labour Union is the target.

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

TL;DR

Abstract

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Authors

TL;DR

Abstract

Table of Contents

Figures (3)