Table of Contents
Fetching ...

Cycles and collusion in congestion games under Q-learning

Cesare Carissimo, Jan Nagler, Heinrich Nax

TL;DR

A novel perspective for thinking about regulation and collusion is suggested, and an important incentive incompatibility is revealed when thinking in terms of a meta-game being played by the designers of the individual Q-learners who set their agents' parameters.

Abstract

We investigate the dynamics of Q-learning in a class of generalized Braess paradox games. These games represent an important class of network routing games where the associated stage-game Nash equilibria do not constitute social optima. We provide a full convergence analysis of Q-learning with varying parameters and learning rates. A wide range of phenomena emerges, broadly either settling into Nash or cycling continuously in ways reminiscent of "Edgeworth cycles" (i.e. jumping suddenly from Nash toward social optimum and then deteriorating gradually back to Nash). Our results reveal an important incentive incompatibility when thinking in terms of a meta-game being played by the designers of the individual Q-learners who set their agents' parameters. Indeed, Nash equilibria of the meta-game are characterized by heterogeneous parameters, and resulting outcomes achieve little to no cooperation beyond Nash. In conclusion, we suggest a novel perspective for thinking about regulation and collusion, and discuss the implications of our results for Bertrand oligopoly pricing games.

Cycles and collusion in congestion games under Q-learning

TL;DR

A novel perspective for thinking about regulation and collusion is suggested, and an important incentive incompatibility is revealed when thinking in terms of a meta-game being played by the designers of the individual Q-learners who set their agents' parameters.

Abstract

We investigate the dynamics of Q-learning in a class of generalized Braess paradox games. These games represent an important class of network routing games where the associated stage-game Nash equilibria do not constitute social optima. We provide a full convergence analysis of Q-learning with varying parameters and learning rates. A wide range of phenomena emerges, broadly either settling into Nash or cycling continuously in ways reminiscent of "Edgeworth cycles" (i.e. jumping suddenly from Nash toward social optimum and then deteriorating gradually back to Nash). Our results reveal an important incentive incompatibility when thinking in terms of a meta-game being played by the designers of the individual Q-learners who set their agents' parameters. Indeed, Nash equilibria of the meta-game are characterized by heterogeneous parameters, and resulting outcomes achieve little to no cooperation beyond Nash. In conclusion, we suggest a novel perspective for thinking about regulation and collusion, and discuss the implications of our results for Bertrand oligopoly pricing games.

Paper Structure

This paper contains 32 sections, 9 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The Braess Paradox network where three paths are possible: up, down and cross. Splitting 50% of the population up, and the other 50% down is socially optimal, but the Nash equilibrium has all agents picking cross. Numbers represent link costs, where "x" is the fraction of total agents $f(a)/n$.
  • Figure 2: 100 Q-learners in Braess Paradox $(\epsilon=0.01, \beta=0)$ display the characteristic behaviour of Edgeworth Cycles. Left: $\alpha=0.7$, a relatively fast learning rate which leads to cycles with short periods $L$, and high asymmetry $F$: steep regions approaching the one-shot Nash. Right: $\alpha=0.01$, a relatively slow learning rate which leads to cycles with long periods $L$, and less asymmetry $F$: flat regions approaching the one-shot Nash.
  • Figure 3: Correlation matrices between dependent and independent variables. Left: Correlations of $\alpha$, $\beta$ with $L$, $F$, $\langle C \rangle$, and $\sigma{\langle C \rangle}$. Middle: Correlations of $\alpha$, when $\beta=0$, with $L$, $F$, $\langle C \rangle$, and $\sigma{\langle C \rangle}$. Right: Correlations of with $L$, $F$, $\langle C \rangle$, and $\sigma{\langle C \rangle}$ with themselves. The variable $\sigma_{\langle C \rangle}$ is defined as the standard deviation of $\langle C \rangle$, $\sigma_{\langle C \rangle} = \sqrt{\langle C \rangle^2 - \langle C^2 \rangle}$.
  • Figure 4: (Row 1) $\beta=0$ for all experiments, left: period of cycles as a function of $\alpha$, right: probability of increase of cycles as a function of $\alpha$. (We explain the discontinuity near $alpha=0$ is likely due to measurement error \ref{['appendix:measurements']}) (Row 2) full ablation study of $\alpha$ and $\beta$, left: $\log$ of the cycle period for color rendition, right: probability of increase of cycles. (Row 3) left: ablation study of $\alpha$ and $\beta$ for the time-averaged travel time, right: ablation of $\alpha$ setting $\beta=0$ for two initializations of $q$-values, random and Nash EQ.
  • Figure 5: Advantage $D_j$ (Eq. \ref{['eq:advantage']}) of [left] an agent which picks alpha (vertical) against population of agents with fixed alpha (horizontal), [right] an agent which picks epsilon (vertical) against population of agents with fixed epsilon (horizontal)
  • ...and 4 more figures