Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Joongkyu Lee; Min-hwan Oh

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Joongkyu Lee, Min-hwan Oh

TL;DR

The paper tackles contextual MNL bandits with assortments of size up to $K$, revealing minimax regret bounds that depend on the outside option attraction $v_0$ and the reward structure. It introduces OFU-MNL+, a constant-time algorithm based on online mirror descent and optimistic revenue, achieving near-optimal regret in the uniform setting: $\\tilde{\\mathcal{O}}\\left( \\frac{\\sqrt{v_0 K}}{v_0+K} d \\sqrt{T} \\right)$ (plus a $d^2/\\kappa$ term), which collapses to $\\tilde{\\mathcal{O}}(d \\sqrt{T/K})$ when $v_0=\\Theta(1)$, and to $\\tilde{\\mathcal{O}}(d \\sqrt{T})$ when $v_0=\\Theta(K)$. For non-uniform rewards, it proves a matching lower bound of $\\Omega(d \\sqrt{T})$ and an upper bound of $\\tilde{O}(d \\sqrt{T})$, both attainable by OFU-MNL+. The work also provides instance-dependent bounds under uniform rewards and demonstrates empirical performance improvements over existing MNL bandit algorithms. Overall, it delivers the first nearly minimax-optimal and computationally efficient contextual MNL bandit framework, with clear guidance on how $K$ and $v_0$ influence regret.

Abstract

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Ω(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Ω(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

TL;DR

The paper tackles contextual MNL bandits with assortments of size up to

, revealing minimax regret bounds that depend on the outside option attraction

and the reward structure. It introduces OFU-MNL+, a constant-time algorithm based on online mirror descent and optimistic revenue, achieving near-optimal regret in the uniform setting:

(plus a

term), which collapses to

when

, and to

when

. For non-uniform rewards, it proves a matching lower bound of

and an upper bound of

, both attainable by OFU-MNL+. The work also provides instance-dependent bounds under uniform rewards and demonstrates empirical performance improvements over existing MNL bandit algorithms. Overall, it delivers the first nearly minimax-optimal and computationally efficient contextual MNL bandit framework, with clear guidance on how

and

influence regret.

Abstract

. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of

and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of

. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of

and an upper bound of

, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

Paper Structure (56 sections, 37 theorems, 211 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 56 sections, 37 theorems, 211 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Problem Setting
Existing Gap between Upper and Lower Bounds in MNL Bandits
Algorithms and Main Results
Regret Lower Bound under Uniform Rewards
Minimax Optimal Regret Upper Bound under Uniform Rewards
Regret Upper & Lower Bounds under Non-Uniform Rewards
Instance-Dependent Bounds
Numerical Experiments
Conclusion
Appendix
Further Related Work
Notation
Properties of MNL function
...and 41 more sections

Key Result

Theorem 1

Let $d$ be divisible by $4$ and let Assumption assum:bounded_assumption hold true. Suppose $T \geq C \cdot d^4 (v_0 + K) / K$ for some constant $C>0$. Then, in the uniform reward setting, for any policy $\pi$, there exists a worst-case problem instance such that the worst-case expected regret of $\p

Figures (3)

Figure 1: Cumulative regret (left three, $K=5,10,15$) and runtime per round (rightmost one, $K=15$) under uniform rewards (first row) and non-uniform rewards (second row) with $v_0 = 1$.
Figure K.1: Runtime per round under uniform rewards (first row) and non-uniform rewards (second row).
Figure K.2: Cumulative regret under uniform rewards with $v_0 = \Theta(K)$.

Theorems & Definitions (44)

Remark 1
Theorem 1: Regret lower bound, Uniform rewards
Lemma 1: Online parameter confidence set
Remark 2: Comparison to zhang2024online
Theorem 2: Regret upper bound of OFU-MNL+, Uniform rewards
Remark 3: Efficiency of OFU-MNL+
Theorem 3: Regret lower bound, Non-uniform rewards
Theorem 4: Regret upper bound of OFU-MNL+, Non-uniform rewards
Proposition 1: Instance-dependent regret lower bound, Uniform rewards
Proposition 2: Instance-dependent regret upper bound of OFU-MNL+, Uniform rewards
...and 34 more

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

TL;DR

Abstract

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (44)