Table of Contents
Fetching ...

Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Runyu Zhang, Na Li, Asuman Ozdaglar, Jeff Shamma, Gioele Zardini

TL;DR

The paper addresses risk sensitivity in multi-agent reinforcement learning by reframing optimism as controlled risk-seeking through convex risk measures. It develops optimistic value functions via a dual representation, derives a policy-gradient theorem for these functions (specializing to entropic risk with KL penalties), and proposes decentralized optimistic actor-critic algorithms. Theoretical results connect optimism to divergence-penalized risk evaluation, while empirical results on gridworlds and a cooperative ball-balancing task show improved coordination and robustness over risk-neutral and heuristic optimistic baselines. This work unifies risk-sensitive learning with optimism in MARL and offers a principled, practically effective framework for cooperative decision making under uncertainty.

Abstract

Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

TL;DR

The paper addresses risk sensitivity in multi-agent reinforcement learning by reframing optimism as controlled risk-seeking through convex risk measures. It develops optimistic value functions via a dual representation, derives a policy-gradient theorem for these functions (specializing to entropic risk with KL penalties), and proposes decentralized optimistic actor-critic algorithms. Theoretical results connect optimism to divergence-penalized risk evaluation, while empirical results on gridworlds and a cooperative ball-balancing task show improved coordination and robustness over risk-neutral and heuristic optimistic baselines. This work unifies risk-sensitive learning with optimism in MARL and offers a principled, practically effective framework for cooperative decision making under uncertainty.

Abstract

Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

Paper Structure

This paper contains 15 sections, 6 theorems, 39 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

The function $\sigma:\mathcal{F}\to \mathbb{R}$ is a convex risk measure if and only if there exists a "penalty function" $D(\cdot): \Delta^{|\mathcal{A}|}\to \mathbb{R}$ such that In specific, $D$ can be written in the following form:

Figures (5)

  • Figure 1: Reward table $R$
  • Figure 2: Risk-neutral
  • Figure 3: Optimistic
  • Figure 4: Cooperative Ball Balancing matignon2007hysteretic
  • Figure 5: Learning curves for different algorithms.

Theorems & Definitions (18)

  • Theorem 1: Dual Representation Theorem follmer2002convex
  • Example 1: Entropy risk measure follmer2002convex
  • Lemma 1: Bellman Equation
  • proof
  • Theorem 2: Policy gradient theorem for optimistic value function
  • Remark 1: Interpretation
  • Lemma 2: Multi-agent optimistic policy gradient
  • proof
  • Remark 2: Interpretation
  • Remark 3: Practical approximation
  • ...and 8 more