Table of Contents
Fetching ...

Simple Opinion Dynamics for No-Regret Learning

John Lazarsfeld, Dan Alistarh

TL;DR

It is proved for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds.

Abstract

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.

Simple Opinion Dynamics for No-Regret Learning

TL;DR

It is proved for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like R(T)/T = \widetilde O(1/T)\widetilde O(\sqrt{n})$ rounds.

Abstract

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like , and also reaching consensus on the highest-mean action within rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like as long as the number of rounds does not grow too fast as a function of .
Paper Structure (46 sections, 28 theorems, 87 equations, 1 figure)

This paper contains 46 sections, 28 theorems, 87 equations, 1 figure.

Key Result

Proposition 2.0

Consider running an adoption protocol with function $f$. Let $f(\mathbf{g}^t)$ denote the coordinate-wise application of $f$ on $\mathbf{g}^t$. Then $\mathbf{E}_t[p^{t+1}_j] = p^t_j ( 1 + f(g^t_j) - \langle \mathbf{p}^{t}, f(\mathbf{g}^t)\rangle )$ for every $t$ and $j$.

Figures (1)

  • Figure 1: A depiction of the three phase structure used in the proof of Theorem \ref{['thm:stationary-regret']}, and the corresponding cumulative regret bound of each phase. The dotted vertical line denotes the time at which best-action consensus is reached with high probability in Theorem \ref{['thm:consensus']}.

Theorems & Definitions (49)

  • Definition 1.1: Population-level Regret
  • Definition 1.2: Best-Action Consensus
  • Proposition 2.0
  • Definition 2.1: Zero-Sum MWU
  • Definition 2.1: Coupled Trajectories
  • Proposition 2.2
  • Proposition 2.2
  • Theorem 2.3: Regret of Zero-sum MWU
  • Proposition 2.3
  • Theorem 2.4
  • ...and 39 more