Combinatorial Allocation Bandits with Nonlinear Arm Utility

Yuki Shibukawa; Koichi Tanaka; Yuta Saito; Shinji Ito

Combinatorial Allocation Bandits with Nonlinear Arm Utility

Yuki Shibukawa, Koichi Tanaka, Yuta Saito, Shinji Ito

TL;DR

A novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*, and provides an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case.

Abstract

A matching platform is a system that matches different types of participants, such as companies and job-seekers. In such a platform, merely maximizing the number of matches can result in matches being concentrated on highly popular participants, which may increase dissatisfaction among other participants, such as companies, and ultimately lead to their churn, reducing the platform's profit opportunities. To address this issue, we propose a novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*. In CAB, at each round $t=1,\dots,T$, the learner observes $K$ feature vectors corresponding to $K$ arms for each of $N$ users, assigns each user to an arm, and then observes feedback following a generalized linear model (GLM). Unlike prior work, the learner's objective is not to maximize the number of positive feedback, but rather to maximize the arm satisfaction. For CAB, we provide an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case. Furthermore, we propose a TS algorithm and provide an approximate regret upper bound. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of the proposed algorithms compared to other methods.

Combinatorial Allocation Bandits with Nonlinear Arm Utility

TL;DR

Abstract

, the learner observes

feature vectors corresponding to

arms for each of

users, assigns each user to an arm, and then observes feedback following a generalized linear model (GLM). Unlike prior work, the learner's objective is not to maximize the number of positive feedback, but rather to maximize the arm satisfaction. For CAB, we provide an upper confidence bound algorithm that achieves an approximate regret upper bound, which matches the existing lower bound for the special case. Furthermore, we propose a TS algorithm and provide an approximate regret upper bound. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of the proposed algorithms compared to other methods.

Paper Structure (36 sections, 16 theorems, 56 equations, 4 figures, 1 table, 5 algorithms)

This paper contains 36 sections, 16 theorems, 56 equations, 4 figures, 1 table, 5 algorithms.

Introduction
Our Contributions
Technical Challenges
Preliminaries
Notations
Generalized Linear Models
Submodular Welfare Problem
Combinatorial Allocation Bandits
Problem Setting
Algorithm and Theoretical Results
Upper Confidence Bound Algorithm
Regret Analysis
Thompson Sampling Algorithm
Regret Analysis
Implementation of Algorithms
...and 21 more sections

Key Result

Lemma 2.1

There is a $1-1/e$-approximate algorithm for the submodular welfare problem when the utility functions are monotone submodular, under the value oracle model.

Figures (4)

Figure 1: This figure provides a schematic illustration comparing the results obtained by maximizing the number of matches with the desired matches obtained using satisfaction. We assume that arm A is the most popular firm, and that popularity decreases toward arm D.
Figure 2: Comparisons of cumulative satisfaction and match (a) at each time step, and with varying (b) satisfaction parameters ($\beta$), (c) arm popularity parameters ($\lambda$). Note that the results in (b) and (c) are normalized by those of the optimal algorithm. We show empirical selection probabilities of each arm under each method in (d), average sum of expected matches in the last 10 steps in (e).
Figure 3: Comparisons of cumulative satisfaction (our objective) and match (typical objective) with varying the number of arms.
Figure 4: Comparisons of cumulative satisfaction (our objective) and match (typical objective) with varying $\gamma$ in FairX.

Theorems & Definitions (29)

Lemma 2.1: STOC2008submodular_welfare
Theorem 4.1
Theorem 4.2
Theorem C.1
proof
Lemma D.1
proof
Lemma D.2
proof
Lemma D.3: takemura2021near_combinatorial
...and 19 more

Combinatorial Allocation Bandits with Nonlinear Arm Utility

TL;DR

Abstract

Combinatorial Allocation Bandits with Nonlinear Arm Utility

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (29)