Table of Contents
Fetching ...

A Contextual Combinatorial Bandit Approach to Negotiation

Yexin Li, Zhancun Mu, Siyuan Qi

TL;DR

This work casts negotiation as a contextual combinatorial bandit problem to address exploration-exploitation under large action spaces and partial observations. It introduces NegUCB, a kernelized, full-bandit learning algorithm that utilizes hidden states and context via kernel regression to learn the acceptance function and optimize bids. The authors prove a sub-linear regret bound that is independent of bid cardinality and demonstrate strong empirical performance across multi-issue negotiation, resource allocation, and trading tasks, outperforming strong baselines. The approach offers a scalable, principled framework for learning negotiation strategies in complex, real-world settings with partial observability and nonlinear reward structures.

Abstract

Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.

A Contextual Combinatorial Bandit Approach to Negotiation

TL;DR

This work casts negotiation as a contextual combinatorial bandit problem to address exploration-exploitation under large action spaces and partial observations. It introduces NegUCB, a kernelized, full-bandit learning algorithm that utilizes hidden states and context via kernel regression to learn the acceptance function and optimize bids. The authors prove a sub-linear regret bound that is independent of bid cardinality and demonstrate strong empirical performance across multi-issue negotiation, resource allocation, and trading tasks, outperforming strong baselines. The approach offers a scalable, principled framework for learning negotiation strategies in complex, real-world settings with partial observability and nonlinear reward structures.

Abstract

Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.
Paper Structure (32 sections, 3 theorems, 31 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 3 theorems, 31 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.3

Instead of learning transformation function $\phi$, parameters $\boldsymbol{\Theta}$ and $\boldsymbol{U}$, and then estimating $r_{\tau + 1} (\boldsymbol{b})$ by eq:accept eq., it is equivalent to iterate eq:a_theta and eq:d_u, then estimate $r_{\tau + 1} (\boldsymbol{b})$ using eq:exploit_by_kernel

Figures (9)

  • Figure 1: Three typical types of negotiation. Negotiator $a$ is represented with the same icon but in varying colors, indicating the same agent whose state evolves. Negotiator $g$ is depicted with distinct icons and colors, meaning different counterparts. (a) illustrates a trading task, where items in Red signify those that negotiator $a$ gives to $g$, while items in Green indicate those that counterpart $g$ gives to $a$. (b) presents a resource allocation task. Items in green are proposed for allocation to negotiator $a$, while those in red are suggested for assignment to negotiator $g$. Lastly, (c) portrays a multi-issue negotiation task involving two distinct issues, each offering several value choices. Negotiators $a$ and $g$ aim to agree on the values of these two issues.
  • Figure 2: Acceptance function of a resource allocation task. (a) and (b) describe the contexts of the items and the current negotiator pair, respectively. (c) provides an example illustrating how the bid can be defined and how to extract the bid context. (d) depicts the acceptance label. The goal is to approximate the acceptance function $\bar{r}: (\boldsymbol{Y}, \boldsymbol{x}, \boldsymbol{b}) \mapsto r$ using historical negotiation data.
  • Figure 3: Negotiation steps needed to reach a deal on each ANAC domain of domain 00 - 49.
  • Figure 4: Experiment results of resource allocation task. Theoretical regret represents the difference between the estimated $\bar{r}$ and the simulated $r$. Acceptance regret refers to the difference between the estimated and simulated acceptance.
  • Figure 5: Acceptance rate on resource allocation task, which is defined as the percentage of the proposed bids being accepted.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Lemma 3.3
  • Lemma 3.4
  • Theorem 3.5
  • proof
  • proof
  • proof