Table of Contents
Fetching ...

Diversity-Preserving K-Armed Bandits, Revisited

Hédi Hadiji, Sébastien Gerchinovitz, Jean-Michel Loubes, Gilles Stoltz

TL;DR

The work studies diversity-preserving K-armed bandits, where a distribution over arms is chosen before sampling, ensuring that all arms receive exposure and polarization is avoided. It introduces a simple UCB-style DivP-UCB strategy that operates over the diversity set and demonstrates that, for sub-Gaussian rewards on polytopes, the regret is bounded when the optimal distributions place positive mass on every arm and logarithmic otherwise; it also derives a lower bound showing $\ln T$ regret is unavoidable when some optimal mass is zero. The results extend beyond polytopes to curved sets, showing a $\ln^2 T$ rate in a ball-in-simplex example, highlighting the role of curvature in improving rates relative to standard linear-bandit bounds. The findings clarify when constant, logarithmic, or squared-log regret can be achieved under diversity constraints and connect to related work on mediator feedback and structured bandits. Overall, the paper provides a refined understanding of the trade-offs between diversity, information, and regret in constrained bandit learning with meaningful practical implications for fair recommender systems.

Abstract

We consider the bandit-based framework for diversity-preserving recommendations introduced by Celis et al. (2019), who approached it in the case of a polytope mainly by a reduction to the setting of linear bandits. We design a UCB algorithm using the specific structure of the setting and show that it enjoys a bounded distribution-dependent regret in the natural cases when the optimal mixed actions put some probability mass on all actions (i.e., when diversity is desirable). The regret lower bounds provided show that otherwise, at least when the model is mean-unbounded, a $\ln T$ regret is suffered. We also discuss an example beyond the special case of polytopes.

Diversity-Preserving K-Armed Bandits, Revisited

TL;DR

The work studies diversity-preserving K-armed bandits, where a distribution over arms is chosen before sampling, ensuring that all arms receive exposure and polarization is avoided. It introduces a simple UCB-style DivP-UCB strategy that operates over the diversity set and demonstrates that, for sub-Gaussian rewards on polytopes, the regret is bounded when the optimal distributions place positive mass on every arm and logarithmic otherwise; it also derives a lower bound showing regret is unavoidable when some optimal mass is zero. The results extend beyond polytopes to curved sets, showing a rate in a ball-in-simplex example, highlighting the role of curvature in improving rates relative to standard linear-bandit bounds. The findings clarify when constant, logarithmic, or squared-log regret can be achieved under diversity constraints and connect to related work on mediator feedback and structured bandits. Overall, the paper provides a refined understanding of the trade-offs between diversity, information, and regret in constrained bandit learning with meaningful practical implications for fair recommender systems.

Abstract

We consider the bandit-based framework for diversity-preserving recommendations introduced by Celis et al. (2019), who approached it in the case of a polytope mainly by a reduction to the setting of linear bandits. We design a UCB algorithm using the specific structure of the setting and show that it enjoys a bounded distribution-dependent regret in the natural cases when the optimal mixed actions put some probability mass on all actions (i.e., when diversity is desirable). The regret lower bounds provided show that otherwise, at least when the model is mean-unbounded, a regret is suffered. We also discuss an example beyond the special case of polytopes.

Paper Structure

This paper contains 49 sections, 12 theorems, 102 equations, 2 figures.

Key Result

Theorem 1

Let $\mathcal{P}$ be a polytope. Consider a sub-Gaussian model $\mathcal{D}$ with parameter $\sigma^2$, known and used by the diversity-preserving UCB strategy of Box B. The regret of the latter satisfies where $C_{\underline{\nu}}$ and $c_{\underline{\nu}}$ are quantities depending on $\underline{\nu}$ and whose general closed-form expressions may be read in eq:lnT-closed-form. In addition, for

Figures (2)

  • Figure 1: Estimated expected cumulative regret over time, in the case $\alpha = -0.1$ [top figure, bounded regret] and $\alpha = 0.1$ [bottom figure, $\ln T$ growth], for the two algorithms considered. Solid lines report empirical means while shaded areas correspond to $\pm 2$ standard errors of the series defining the empirical means.
  • Figure :

Theorems & Definitions (28)

  • Definition 1
  • Example 1: Avoiding polarization
  • Example 2: One-shot version of bandits with knapsacks in the mechanism design
  • Definition 2
  • Definition 3
  • Theorem 1
  • Definition 4
  • Example 3
  • Definition 5: UFC strategies
  • Theorem 2
  • ...and 18 more