Simple Opinion Dynamics for No-Regret Learning

John Lazarsfeld; Dan Alistarh

Simple Opinion Dynamics for No-Regret Learning

John Lazarsfeld, Dan Alistarh

TL;DR

It is proved for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds.

Abstract

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.

Simple Opinion Dynamics for No-Regret Learning

TL;DR

It is proved for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like R(T)/T = \widetilde O(1/T)

\widetilde O(\sqrt{n})$ rounds.

Abstract

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of

agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like

, and also reaching consensus on the highest-mean action within

rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like

as long as the number of rounds does not grow too fast as a function of

Paper Structure (46 sections, 28 theorems, 87 equations, 1 figure)

This paper contains 46 sections, 28 theorems, 87 equations, 1 figure.

Introduction
Problem Setting
Algorithmic focus: learning via simple opinion dynamics
Best-action consensus
Related works with GOSSIP communication
Our Contributions
Summary of techniques
Structure of paper
Other Related Work
Multi-armed bandits and online learning
Opinion and consensus dynamics
Global dynamics from local protocols
Technical Overview
Notation and Other Preliminaries
Adoption and Comparison Protocols
...and 31 more sections

Key Result

Proposition 2.0

Consider running an adoption protocol with function $f$. Let $f(\mathbf{g}^t)$ denote the coordinate-wise application of $f$ on $\mathbf{g}^t$. Then $\mathbf{E}_t[p^{t+1}_j] = p^t_j ( 1 + f(g^t_j) - \langle \mathbf{p}^{t}, f(\mathbf{g}^t)\rangle )$ for every $t$ and $j$.

Figures (1)

Figure 1: A depiction of the three phase structure used in the proof of Theorem \ref{['thm:stationary-regret']}, and the corresponding cumulative regret bound of each phase. The dotted vertical line denotes the time at which best-action consensus is reached with high probability in Theorem \ref{['thm:consensus']}.

Theorems & Definitions (49)

Definition 1.1: Population-level Regret
Definition 1.2: Best-Action Consensus
Proposition 2.0
Definition 2.1: Zero-Sum MWU
Definition 2.1: Coupled Trajectories
Proposition 2.2
Proposition 2.2
Theorem 2.3: Regret of Zero-sum MWU
Proposition 2.3
Theorem 2.4
...and 39 more

Simple Opinion Dynamics for No-Regret Learning

TL;DR

Abstract

Simple Opinion Dynamics for No-Regret Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (49)