Enjoying Non-linearity in Multinomial Logistic Bandits
Pierre Boudart, Pierre Gaillard, Alessandro Rudi
TL;DR
We address the multinomial logistic bandit problem with $K$ outcomes, introducing a generalized nonlinearity constant $\kappa_*$ defined at the optimum to capture curvature. We propose an efficient OFU-based algorithm that leverages a two-phase explore-then-learn strategy and self-concordance of the softmax to achieve a regret of $\widetilde{O}(R d \sqrt{K T / \kappa_*})$, along with a matching lower bound $\Omega(R d \sqrt{K T / \kappa_*})$, establishing minimax-optimality and the optimality of $\kappa_*$. The analysis handles the nonlinearity through a carefully designed exploration stage that yields a tight confidence set and a tractable optimistic reward, achieving $O(1)$ per-round computation. These results generalize known binary logistic-bandit improvements to the multinomial setting and demonstrate that nonlinearity can be exploited to improve sequential decision-making in reinforcement learning and recommender-system contexts.
Abstract
We consider the multinomial logistic bandit problem, a variant of where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $κ_* \geq 1$, that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/κ_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of $κ_*$ to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( R d \sqrt{{KT}/{κ_*}})} $, where $R$ is the norm of the vector of rewards and $K$ is the number of outcomes. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( RdK \sqrt{T} )} $. Moreover, we provide a $\smash{ Ω(Rd\sqrt{KT/κ_*})}$ lower-bound, showing that our algorithm is minimax-optimal and that our definition of $κ_*$ is optimal.
