Table of Contents
Fetching ...

Competing Bandits: The Perils of Exploration Under Competition

Guy Aridor, Yishay Mansour, Aleksandrs Slivkins, Zhiwei Steven Wu

TL;DR

This paper investigates how competition among learning platforms interacts with exploration in multi-armed bandit problems. It develops a Bayesian-choice theory for a two-firm duopoly and complements it with extensive simulations in a reputation-based model to capture data-driven learning dynamics. The results reveal an inverted-U relationship: very intense competition suppresses innovation, while moderated competition or first-mover advantages promote adoption of better exploration algorithms and improve welfare. The analysis highlights data as a self-reinforcing asset that can create powerful barriers to entry and endogenize network effects in digital markets, offering policy-relevant insights into data regimes and competition in data-intensive platforms.

Abstract

Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.

Competing Bandits: The Perils of Exploration Under Competition

TL;DR

This paper investigates how competition among learning platforms interacts with exploration in multi-armed bandit problems. It develops a Bayesian-choice theory for a two-firm duopoly and complements it with extensive simulations in a reputation-based model to capture data-driven learning dynamics. The results reveal an inverted-U relationship: very intense competition suppresses innovation, while moderated competition or first-mover advantages promote adoption of better exploration algorithms and improve welfare. The analysis highlights data as a self-reinforcing asset that can create powerful barriers to entry and endogenize network effects in digital markets, offering policy-relevant insights into data regimes and competition in data-intensive platforms.

Abstract

Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.

Paper Structure

This paper contains 34 sections, 20 theorems, 46 equations, 12 figures, 19 tables.

Key Result

Theorem 4.2

Assume $\mathtt{HardMax}$ response function with fair tie-breaking: $f_{\mathtt{resp}\xspace}(0)=1/2$. Assume that $\mathtt{alg}_{1}$ is $\mathtt{BayesGreedy}$, and $\mathtt{alg}_{2}$ deviates from $\mathtt{BayesGreedy}$ starting from some (local) step $n_0<T$. Then all agents in global rounds $t\ge

Figures (12)

  • Figure 1: The stylized inverted-U relationship.
  • Figure 2: The models for $f_{\mathtt{resp}\xspace}$: $\mathtt{HardMax}$ is thick blue, $\mathtt{HardMax\&Random}$ is red, and $\mathtt{SoftMax}$ is dashed.
  • Figure 3: The stylized inverted-U relationship from the "secondary story"
  • Figure 4: Mean reputation trajectory (left) and mean instantaneous reward trajectory (right) for Needle-in-Haystack. The shaded area shows 95% confidence intervals. The shorthand for the algorithms is the same as in the main text: resp., $\mathtt{BayesEpsilonGreedy}\xspace (\mathtt{BEG}\xspace)$, $\mathtt{BayesGreedy}\xspace (\mathtt{BG}\xspace)$, and $\mathtt{ThompsonSampling}\xspace (\mathtt{TS}\xspace)$.
  • Figure 5: Relative reputation trajectory for $\mathtt{ThompsonSampling}\xspace$ vs $\mathtt{BayesGreedy}\xspace$, on Uniform instance (left) and Needle-in-Haystack instance (right). Shaded area display 95% confidence intervals. The relative reputation at time $t$ is the fraction of mean reward vectors for which, at time $t$, $\mathtt{ThompsonSampling}\xspace$ has a higher reputation score than $\mathtt{BayesGreedy}\xspace$.
  • ...and 7 more figures

Theorems & Definitions (39)

  • Definition 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Lemma 4.4
  • Lemma 4.5
  • Theorem 4.6
  • proof : Proof Sketch
  • Theorem 4.7
  • Definition 4.8
  • Theorem 4.9
  • ...and 29 more