Table of Contents
Fetching ...

Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues

Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale M. Lucas, Jonathan Gratch

TL;DR

This work aims to systematically analyze the multifaceted capabilities of LLMs across diverse dialogue scenarios throughout the stages of a typical negotiation interaction, highlighting GPT-4's superior performance in many tasks while identifying specific challenges, such as making subjective assessments and generating contextually appropriate, strategically advantageous responses.

Abstract

A successful negotiation requires a range of capabilities, including comprehension of the conversation context, Theory-of-Mind (ToM) skills to infer the partner's motives, strategic reasoning, and effective communication, making it challenging for automated systems. Despite the remarkable performance of LLMs in various NLP tasks, there is no systematic evaluation of their capabilities in negotiation. Such an evaluation is critical for advancing AI negotiation agents and negotiation research, ranging from designing dialogue systems to providing pedagogical feedback and scaling up data collection practices. This work aims to systematically analyze the multifaceted capabilities of LLMs across diverse dialogue scenarios throughout the stages of a typical negotiation interaction. Our analysis highlights GPT-4's superior performance in many tasks while identifying specific challenges, such as making subjective assessments and generating contextually appropriate, strategically advantageous responses.

Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues

TL;DR

This work aims to systematically analyze the multifaceted capabilities of LLMs across diverse dialogue scenarios throughout the stages of a typical negotiation interaction, highlighting GPT-4's superior performance in many tasks while identifying specific challenges, such as making subjective assessments and generating contextually appropriate, strategically advantageous responses.

Abstract

A successful negotiation requires a range of capabilities, including comprehension of the conversation context, Theory-of-Mind (ToM) skills to infer the partner's motives, strategic reasoning, and effective communication, making it challenging for automated systems. Despite the remarkable performance of LLMs in various NLP tasks, there is no systematic evaluation of their capabilities in negotiation. Such an evaluation is critical for advancing AI negotiation agents and negotiation research, ranging from designing dialogue systems to providing pedagogical feedback and scaling up data collection practices. This work aims to systematically analyze the multifaceted capabilities of LLMs across diverse dialogue scenarios throughout the stages of a typical negotiation interaction. Our analysis highlights GPT-4's superior performance in many tasks while identifying specific challenges, such as making subjective assessments and generating contextually appropriate, strategically advantageous responses.
Paper Structure (33 sections, 10 figures, 17 tables)

This paper contains 33 sections, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Overview of the key capabilities (C1-C4) required for a successful negotiation. We design tasks aligned with these abilities to assess how LLMs can advance different aspects of negotiation research. The negotiation scenario is based on chawla-etal-2021-casino.
  • Figure 2: Our methodology for systematically evaluating LLMs in negotiation dialogues. Part A (top) describes the pipeline for creating task-specific prompts from a negotiation dataset and evaluating various LLMs with them. Part B (bottom) depicts the tasks categorized by Objectivity, Time Stage, and Task Type (Section \ref{['sec:task-design']}).
  • Figure 3: Overall results for zero-shot evaluation of LLMs. F1: macro F1 over all labels, PCC: Pearson Correlation Coefficient. Each bar shows the average result across all suitable tasks in the category. For example, as per (b), GPT-4 gets $65.3\%$ Accuracy on average for Comprehension tasks in End time stage. The tasks for these plots have been carefully selected to ensure a fair comparison, with all models passing generation validity checks (i.e., without null values across models), and details of validity check and full results are in Table \ref{['tab:full_results_all_models']} of Appendix \ref{['append:Details_of_Negotiation_Tasks']}.
  • Figure 4: Confusion matrix of predictions of GPT-4 for the subjective task (end_partner_deal_likeness_ca). E stands for "Extremely", S for "Slightly", D for "Dislike" and L for "Like."
  • Figure 5: GPT-4’s evaluation on the end_deal_total_dnd task, highlighting the impact of Chain-of-Thought (CoT) prompting. Results for other tasks can be found in Figure \ref{['fig:CoT_prompting_append']} in the appendix.
  • ...and 5 more figures