Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
Joongkyu Lee, Min-hwan Oh
TL;DR
The paper tackles contextual MNL bandits with assortments of size up to $K$, revealing minimax regret bounds that depend on the outside option attraction $v_0$ and the reward structure. It introduces OFU-MNL+, a constant-time algorithm based on online mirror descent and optimistic revenue, achieving near-optimal regret in the uniform setting: $\\tilde{\\mathcal{O}}\\left( \\frac{\\sqrt{v_0 K}}{v_0+K} d \\sqrt{T} \\right)$ (plus a $d^2/\\kappa$ term), which collapses to $\\tilde{\\mathcal{O}}(d \\sqrt{T/K})$ when $v_0=\\Theta(1)$, and to $\\tilde{\\mathcal{O}}(d \\sqrt{T})$ when $v_0=\\Theta(K)$. For non-uniform rewards, it proves a matching lower bound of $\\Omega(d \\sqrt{T})$ and an upper bound of $\\tilde{O}(d \\sqrt{T})$, both attainable by OFU-MNL+. The work also provides instance-dependent bounds under uniform rewards and demonstrates empirical performance improvements over existing MNL bandit algorithms. Overall, it delivers the first nearly minimax-optimal and computationally efficient contextual MNL bandit framework, with clear guidance on how $K$ and $v_0$ influence regret.
Abstract
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Ω(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Ω(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
