Table of Contents
Fetching ...

Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov Games

Hafez Ghaemi, Hamed Kebriaei, Alireza Ramezani Moghaddam, Majid Nili Ahamdabadi

TL;DR

The experimental results show that subjective CPT policies obtained by the algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.

Abstract

Classical multi-agent reinforcement learning (MARL) assumes risk neutrality and complete objectivity for agents. However, in settings where agents need to consider or model human economic or social preferences, a notion of risk must be incorporated into the RL optimization problem. This will be of greater importance in MARL where other human or non-human agents are involved, possibly with their own risk-sensitive policies. In this work, we consider risk-sensitive and non-cooperative MARL with cumulative prospect theory (CPT), a non-convex risk measure and a generalization of coherent measures of risk. CPT is capable of explaining loss aversion in humans and their tendency to overestimate/underestimate small/large probabilities. We propose a distributed sampling-based actor-critic (AC) algorithm with CPT risk for network aggregative Markov games (NAMGs), which we call Distributed Nested CPT-AC. Under a set of assumptions, we prove the convergence of the algorithm to a subjective notion of Markov perfect Nash equilibrium in NAMGs. The experimental results show that subjective CPT policies obtained by our algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.

Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov Games

TL;DR

The experimental results show that subjective CPT policies obtained by the algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.

Abstract

Classical multi-agent reinforcement learning (MARL) assumes risk neutrality and complete objectivity for agents. However, in settings where agents need to consider or model human economic or social preferences, a notion of risk must be incorporated into the RL optimization problem. This will be of greater importance in MARL where other human or non-human agents are involved, possibly with their own risk-sensitive policies. In this work, we consider risk-sensitive and non-cooperative MARL with cumulative prospect theory (CPT), a non-convex risk measure and a generalization of coherent measures of risk. CPT is capable of explaining loss aversion in humans and their tendency to overestimate/underestimate small/large probabilities. We propose a distributed sampling-based actor-critic (AC) algorithm with CPT risk for network aggregative Markov games (NAMGs), which we call Distributed Nested CPT-AC. Under a set of assumptions, we prove the convergence of the algorithm to a subjective notion of Markov perfect Nash equilibrium in NAMGs. The experimental results show that subjective CPT policies obtained by our algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.
Paper Structure (15 sections, 2 theorems, 33 equations, 5 figures, 1 algorithm)

This paper contains 15 sections, 2 theorems, 33 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

(Nested CPT Policy Gradient) Given Assumption assumption:w, the gradient of the CPT return for agent $i$, $V^i_{\pi_{\theta}}(s_0)$, with respect to the policy parameter $\theta^i$ is where, $\phi$ and $u$ represent the CPT cumulative weighting and utility functions of the agent from eq:cptdef (superscript $i$ is dropped). The distribution $\mu_{cpt}^i$ is a subjective steady-state probability di

Figures (5)

  • Figure 1: Conventional CPT weighting functions; $\omega^+(p) = \frac{p^{\gamma}}{(p^{\gamma} + (1-p)^{\gamma})^{(1/\gamma)}}$ and $\omega^-(p) = \frac{p^{\delta}}{(p^{\delta} + (1-p)^{\delta})^{(1/\delta)}}$ with $\gamma=\delta=0.69$.
  • Figure 2: Conventional CPT utility functions; The plot shows $u^+(x)=x^{\alpha}$ for $x\geq 0$, and $-u^-(x)=-\lambda(-x)^{\beta})$ for $x<0$, with $\alpha=\beta=0.65$ and $\lambda=2.6$.
  • Figure 3: A network aggregative Markov game
  • Figure 4: Smoothed mean value function of a given state over eight independent runs in Distributed Nested CPT-AC for scenario 2 (all agents are risk-sensitive with $\lambda=2.6$).
  • Figure 5: Mean converged policies over eight independent runs for different loss aversion scenarios. Scenario 1: all agents are risk-neutral, scenario 2: all agents are risk-sensitive ($\lambda=2.6$), scenario 3: only Agent 1 is risk-sensitive ($\lambda=2.6$), scenario 4: Agent 1 has a higher loss aversion coefficient ($\lambda=3.2$) than others ($\lambda=2.6$).

Theorems & Definitions (7)

  • Remark 1
  • Remark 2
  • Theorem 1
  • proof
  • Remark 3
  • Remark 4
  • Theorem 2