Table of Contents
Fetching ...

Differentiable Normative Guidance for Nash Bargaining Solution Recovery

Moirangthem Tiken Singh, Surajit Borkotokey, Rajnish Kumar

Abstract

Autonomous artificial intelligence agents in negotiation systems must generate equitable utility allocations satisfying individual rationality (IR), ensuring each agent receives at least its outside option, and the Nash Bargaining Solution (NBS), which maximizes joint surplus. Existing generative models often learn suboptimal human behaviors, producing solutions far from Pareto efficiency, while classical methods require full Pareto frontier knowledge, which is unavailable in real datasets. We propose a guided graph diffusion framework that generates individually rational utility vectors while approximating the NBS without frontier knowledge at inference time. Negotiations are modeled as directed graphs with graph attention capturing asymmetric agent attributes, and a conditional diffusion model maps these to utility vectors. A differentiable composite guidance loss, applied in the final reverse diffusion steps, penalizes IR violations and Nash product gaps. We prove that, under sufficient penalty weighting, solutions enter the IR region in finite time. Across datasets, the method achieves 100% IR compliance. Nash efficiency reaches 99.45% on synthetic data (within 0.55 percentage points of an oracle), and 54.24% (CaSiNo) and 88.67% (Deal or No Deal), improving 20-60 percentage points over unconstrained generative baselines.

Differentiable Normative Guidance for Nash Bargaining Solution Recovery

Abstract

Autonomous artificial intelligence agents in negotiation systems must generate equitable utility allocations satisfying individual rationality (IR), ensuring each agent receives at least its outside option, and the Nash Bargaining Solution (NBS), which maximizes joint surplus. Existing generative models often learn suboptimal human behaviors, producing solutions far from Pareto efficiency, while classical methods require full Pareto frontier knowledge, which is unavailable in real datasets. We propose a guided graph diffusion framework that generates individually rational utility vectors while approximating the NBS without frontier knowledge at inference time. Negotiations are modeled as directed graphs with graph attention capturing asymmetric agent attributes, and a conditional diffusion model maps these to utility vectors. A differentiable composite guidance loss, applied in the final reverse diffusion steps, penalizes IR violations and Nash product gaps. We prove that, under sufficient penalty weighting, solutions enter the IR region in finite time. Across datasets, the method achieves 100% IR compliance. Nash efficiency reaches 99.45% on synthetic data (within 0.55 percentage points of an oracle), and 54.24% (CaSiNo) and 88.67% (Deal or No Deal), improving 20-60 percentage points over unconstrained generative baselines.

Paper Structure

This paper contains 21 sections, 2 theorems, 16 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Lemma 3.1

For any initial sample $\mathbf{u}_T \notin \mathcal{F}$, if the IR penalty weight satisfies $\beta > M + \sup_{\mathbf{u} \notin \mathcal{F},\, t} \|\mathbf{f}(\mathbf{u}, t)\|$, the guided trajectory enters $\mathcal{F}$ in finite time.

Figures (6)

  • Figure 1: System architecture for equitable utility generation. Strategic Graph Encoding: Agent feature vectors $\mathbf{x}_1, \mathbf{x}_2 \in \mathbb{R}^k$, encoding negotiation attributes (disagreement points, resource constraints, preference weights), are processed by a Graph Attention Network v2 (GATv2) with $K$ attention heads over an $n$-node directed graph, producing a shared context embedding $h \in \mathbb{R}^{d_h}$. Conditional Multilayer Perceptron (MLP) Diffusion: At each reverse diffusion step, $h$ is concatenated with the noisy state $\mathbf{u}_t \in \mathbb{R}^n$ and sinusoidal time embedding $t_{\text{emb}} \in \mathbb{R}^{d_t}$ ($\omega_j = 10000^{-2j/D}$), forming a $d_z$-dimensional input ($d_z = n + d_t + d_h$) to the MLP denoiser $s_\theta$, which predicts noise $\hat{\boldsymbol{\varepsilon}}$. Normative Guidance: In the final $t_{\text{start}}$ fraction of denoising steps, the gradient $\nabla_{\hat{\mathbf{u}}_0} \mathcal{L}_{\text{guide}}$ steers $\hat{\mathbf{u}}_0$ toward Individual Rationality and NBS using the guidance loss.
  • Figure 2: Hyperparameter sensitivity profile for the Synthetic NTU dataset. Left: Spider chart ranking parameters by sensitivity score (range/mean averaged over four metrics), identifying $t_{\text{start}}$ and $\lambda$ as the dominant factors (${\approx}0.80$ and ${\approx}0.70$ respectively), while $\alpha$ and $\gamma$ are effectively saturated (${\leq}0.20$). Right: Spearman $\rho$ matrix quantifying directionality. $\lambda$ exhibits perfect monotone correlation with Nash Product ($\rho=+1.00$) and Nash Efficiency ($\rho=+1.00$), while $t_{\text{start}}$ is strongly negatively correlated ($\rho=-0.88$), confirming that earlier, stronger guidance is the primary driver of Nash-optimal allocation. The $\beta$ row exhibits a counterintuitive negative correlation ($\rho \approx -0.34$ with IR), attributable to gradient overshooting at excessive penalty magnitudes. Symbol--code name correspondence follows Table \ref{['tab:notation']}.
  • Figure 3: Joint 2-D grid search over $\lambda$ (lambda_guide, y-axis) and $t_{\text{start}}$ (guide_start_frac, x-axis) across four evaluation metrics (Synthetic NTU dataset). The dashed blue box marks the base configuration ($\lambda=0.03$, $t_{\text{start}}=0.30$); the solid box marks the composite-optimal configuration. Large $\lambda$ combined with early activation maximizes Nash Product and IR Compliance but increases Frontier Distance; small $\lambda$ with late activation preserves feasibility at the cost of Nash alignment, yielding efficiencies as low as $2.5\%$. No single cell dominates all four metrics simultaneously, confirming that a composite objective is required.
  • Figure 4: Visual evaluation of generated utility allocations across the three negotiation domains under optimized guidance configurations (Table \ref{['tab:optimal_params']}). Panel (a): Spatial distribution of generated utilities in $[0,1]^2$; the guided framework (green) actively repels from the disagreement points $\mathbf{d}$ (black crosses) while concentrating mass near the Pareto frontier arc. Panel (b): Nash product density; rightward shift indicates improved joint surplus and fairness. Panels (c) and (d): IR Compliance and Nash Efficiency distributions quantifying the reduction in axiomatic violations and recovery of operational optimality relative to the unguided baseline.
  • Figure 5: Aggregate trajectory statistics (mean $\pm 1$ std, $n=30$ test cases) for guided (green) and unguided (orange) DDIM chains across the three negotiation domains. The green shaded band indicates the active guidance window ($t/T < t_{\text{start}}$). Top (Synthetic NTU): IR Compliance (right panel) shows the unguided model suffering from late-stage cumulative drift, while the guided chain maintains strict $1.000$ compliance throughout the window. Middle (Deal or No Deal): Nash Product (left panel) diverges from the unguided baseline upon entering the guidance window, arresting joint surplus decay. Bottom (CaSiNo): Frontier Distance (center panel) shows synchronized convergence to zero for both modes, confirming that the MLP denoiser independently places samples near the Pareto arc; guidance provides directional (Nash-optimal) correction, not general feasibility correction.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 3.1: Finite-Time Convergence to IR
  • proof
  • Theorem 3.2: Asymptotic Convergence to the NBS
  • proof