Table of Contents
Fetching ...

Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing

Adhyyan Narang, Sarah Dean, Lillian J Ratliff, Maryam Fazel

TL;DR

An algorithm is proposed that allows learners to "probe" the predictions of peer models, enabling them to learn about users who do not select them, inspired by the recent use of knowledge distillation in modern ML.

Abstract

In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting focuses exclusively on the "local" losses of learners on the distribution of data that they observe. We find that there exist instances where learners who use existing algorithms almost surely converge to models with arbitrarily poor global performance, even when models with low full-population loss exist. This happens through a feedback-induced mechanism, which we call the overspecialization trap: as learners optimize for users who already prefer them, they become less attractive to users outside this base, which further restricts the data they observe. Inspired by the recent use of knowledge distillation in modern ML, we propose an algorithm that allows learners to "probe" the predictions of peer models, enabling them to learn about users who do not select them. Our analysis characterizes when probing succeeds: this procedure converges almost surely to a stationary point with bounded full-population risk when probing sources are sufficiently informative, e.g., a known market leader or a majority of peers with good global performance. We verify our findings with semi-synthetic experiments on the MovieLens, Census, and Amazon Sentiment datasets.

Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing

TL;DR

An algorithm is proposed that allows learners to "probe" the predictions of peer models, enabling them to learn about users who do not select them, inspired by the recent use of knowledge distillation in modern ML.

Abstract

In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting focuses exclusively on the "local" losses of learners on the distribution of data that they observe. We find that there exist instances where learners who use existing algorithms almost surely converge to models with arbitrarily poor global performance, even when models with low full-population loss exist. This happens through a feedback-induced mechanism, which we call the overspecialization trap: as learners optimize for users who already prefer them, they become less attractive to users outside this base, which further restricts the data they observe. Inspired by the recent use of knowledge distillation in modern ML, we propose an algorithm that allows learners to "probe" the predictions of peer models, enabling them to learn about users who do not select them. Our analysis characterizes when probing succeeds: this procedure converges almost surely to a stationary point with bounded full-population risk when probing sources are sufficiently informative, e.g., a known market leader or a majority of peers with good global performance. We verify our findings with semi-synthetic experiments on the MovieLens, Census, and Amazon Sentiment datasets.
Paper Structure (32 sections, 8 theorems, 15 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 32 sections, 8 theorems, 15 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Under Assumption ass:bounded_support, for all $z \in \mathcal{Z}$, the loss $\ell(z, \cdot)$ is non-negative, convex, differentiable, locally Lipschitz, and $\beta_\ell$-smooth.

Figures (4)

  • Figure 1: Illustration of our online multi-learner problem setting. The borders of users represent their highest ranked learner $\pi(z)$. For further details, see \ref{['sec:setup']}.
  • Figure 2: MSGD full-population performance with random initialization (Preference-aware scenario). Left: Census test accuracy. Mid: Amazon sentiment test accuracy Right: MovieLens test loss. The dashed black line represents the performance of a baseline $\theta^\ast$ trained on the full dataset. In all cases, the hyperparameters $(\tau = 0.3, \lambda = 10^{-3})$ are used.
  • Figure 3: Effect of probing on full-population performance (Preference-aware scenario). Left: Census final accuracy vs probing weight $p$. Mid: Amazon sentiment final accuracy vs probing weight $p$ Right: MovieLens final loss vs $p$. In all cases, triangle markers indicate the probing learner.
  • Figure 4: Performance of probing learner on Census as a function of $n$. Error bars show one standard deviation over 10 random seeds.

Theorems & Definitions (12)

  • Lemma 1
  • Definition 1: User Selection Rule
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 2: Globally good peers
  • Definition 3: Preference-aware probing
  • Lemma 3
  • Theorem 4
  • ...and 2 more