Table of Contents
Fetching ...

Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

Jose L. Garcia, Karolina Hajkova, Maria Marchenko, Carlos Miguel Patiño

TL;DR

This work conducts a comprehensive reproducibility study of a negotiated-LMM benchmark, testing open-weight models across a wide size range and a cost-effective GPT-4o Mini. It introduces two key extensions— Pareto-front analysis and a communication-free single-agent baseline—along with two new metrics: structure leakage and inequality (Gini coefficient). The results show that larger open-weight models can approach proprietary performance, while small models struggle with formatting and coherence; notably, a single-agent baseline can match multi-agent results, questioning the necessity of agent communication for success. The study also discusses environmental impact, accessibility, fairness, and privacy, and highlights limitations related to prompt sensitivity, recommending robust prompting and redesigned benchmark dynamics for more faithful assessments of negotiation skills.

Abstract

This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation." We validate the original findings using a range of open-weight models (1.5B-70B parameters) and GPT-4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication-free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models' performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness. Our results demonstrate that smaller models (<10B parameters) struggle with format adherence and coherent responses, but larger open-weight models can approach proprietary model performance. Additionally, in many scenarios, single-agent approaches can achieve comparable results to multi-agent negotiations, challenging assumptions about the necessity of agent communication to perform well on the benchmark. This work also provides insights into the accessibility, fairness, environmental impact, and privacy considerations of LLM-based negotiation systems.

Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

TL;DR

This work conducts a comprehensive reproducibility study of a negotiated-LMM benchmark, testing open-weight models across a wide size range and a cost-effective GPT-4o Mini. It introduces two key extensions— Pareto-front analysis and a communication-free single-agent baseline—along with two new metrics: structure leakage and inequality (Gini coefficient). The results show that larger open-weight models can approach proprietary performance, while small models struggle with formatting and coherence; notably, a single-agent baseline can match multi-agent results, questioning the necessity of agent communication for success. The study also discusses environmental impact, accessibility, fairness, and privacy, and highlights limitations related to prompt sensitivity, recommending robust prompting and redesigned benchmark dynamics for more faithful assessments of negotiation skills.

Abstract

This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation." We validate the original findings using a range of open-weight models (1.5B-70B parameters) and GPT-4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication-free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models' performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness. Our results demonstrate that smaller models (<10B parameters) struggle with format adherence and coherent responses, but larger open-weight models can approach proprietary model performance. Additionally, in many scenarios, single-agent approaches can achieve comparable results to multi-agent negotiations, challenging assumptions about the necessity of agent communication to perform well on the benchmark. This work also provides insights into the accessibility, fairness, environmental impact, and privacy considerations of LLM-based negotiation systems.

Paper Structure

This paper contains 33 sections, 1 equation, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The figure illustrates how performance improves for larger models when controlling for model provider and version (dotted line). It also shows performance gains in newer model versions---e.g., Llama 3.3 70B compared to Llama 3 70B. A more detailed analysis of the failure modes for smaller models and the impact of model size can be found in Section \ref{['sec: small_models']}
  • Figure 2: $p_1$'s deals progression over rounds of GPT-4o Mini experiments in Table \ref{['tab:cot_ablations']}. The results closely match those reported by abdelnabi2024cooperation for GPT-4.
  • Figure 3: Comparison of the single- and multi-agent approaches in the base negotiation game for GPT-4o Mini. The plots show similar results when we allow communication (pink) vs. no communication (green).
  • Figure 4: Example final answer of $p_1$ agent using Qwen 2.5 32B (left) and Qwen 2.5 1.5B (right). The answer on the left shows how the agent fails to include a deal enclosed in <ANSWER> and </ANSWER> with specific mentions of the deal that resulted from the negotiation. Such answers led to the experiments being classified as failed because we could not parse the outcome of the negotiation.
  • Figure 5: Comparison between the single- and multi-agent approaches in the base negotiation game for Qwen 2.5 32B. The plots show similar results when we allow communication (pink) compared to no communication (green).
  • ...and 8 more figures