Table of Contents
Fetching ...

Combinatorial Allocation Bandits with Nonlinear Arm Utility

Yuki Shibukawa, Koichi Tanaka, Yuta Saito, Shinji Ito

TL;DR

A novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*, and provides an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case.

Abstract

A matching platform is a system that matches different types of participants, such as companies and job-seekers. In such a platform, merely maximizing the number of matches can result in matches being concentrated on highly popular participants, which may increase dissatisfaction among other participants, such as companies, and ultimately lead to their churn, reducing the platform's profit opportunities. To address this issue, we propose a novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*. In CAB, at each round $t=1,\dots,T$, the learner observes $K$ feature vectors corresponding to $K$ arms for each of $N$ users, assigns each user to an arm, and then observes feedback following a generalized linear model (GLM). Unlike prior work, the learner's objective is not to maximize the number of positive feedback, but rather to maximize the arm satisfaction. For CAB, we provide an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case. Furthermore, we propose a TS algorithm and provide an approximate regret upper bound. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of the proposed algorithms compared to other methods.

Combinatorial Allocation Bandits with Nonlinear Arm Utility

TL;DR

A novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*, and provides an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case.

Abstract

A matching platform is a system that matches different types of participants, such as companies and job-seekers. In such a platform, merely maximizing the number of matches can result in matches being concentrated on highly popular participants, which may increase dissatisfaction among other participants, such as companies, and ultimately lead to their churn, reducing the platform's profit opportunities. To address this issue, we propose a novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*. In CAB, at each round , the learner observes feature vectors corresponding to arms for each of users, assigns each user to an arm, and then observes feedback following a generalized linear model (GLM). Unlike prior work, the learner's objective is not to maximize the number of positive feedback, but rather to maximize the arm satisfaction. For CAB, we provide an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case. Furthermore, we propose a TS algorithm and provide an approximate regret upper bound. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of the proposed algorithms compared to other methods.
Paper Structure (36 sections, 16 theorems, 56 equations, 4 figures, 1 table, 5 algorithms)

This paper contains 36 sections, 16 theorems, 56 equations, 4 figures, 1 table, 5 algorithms.

Key Result

Lemma 2.1

There is a $1-1/e$-approximate algorithm for the submodular welfare problem when the utility functions are monotone submodular, under the value oracle model.

Figures (4)

  • Figure 1: This figure provides a schematic illustration comparing the results obtained by maximizing the number of matches with the desired matches obtained using satisfaction. We assume that arm A is the most popular firm, and that popularity decreases toward arm D.
  • Figure 2: Comparisons of cumulative satisfaction and match (a) at each time step, and with varying (b) satisfaction parameters ($\beta$), (c) arm popularity parameters ($\lambda$). Note that the results in (b) and (c) are normalized by those of the optimal algorithm. We show empirical selection probabilities of each arm under each method in (d), average sum of expected matches in the last 10 steps in (e).
  • Figure 3: Comparisons of cumulative satisfaction (our objective) and match (typical objective) with varying the number of arms.
  • Figure 4: Comparisons of cumulative satisfaction (our objective) and match (typical objective) with varying $\gamma$ in FairX.

Theorems & Definitions (29)

  • Lemma 2.1: STOC2008submodular_welfare
  • Theorem 4.1
  • Theorem 4.2
  • Theorem C.1
  • proof
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Lemma D.3: takemura2021near_combinatorial
  • ...and 19 more