On (Approximate) Pareto Optimality for the Multinomial Logistic Bandit
Jierui Zuo, Hanzhang Qin
TL;DR
This work tackles dynamic assortment optimization under the Multinomial Logit Bandit with unknown attraction parameters $\mathbf{v}$ by introducing Approximate Pareto Optimality as a joint objective over cumulative regret and parameter estimation accuracy. The authors design a novel UCB-based policy with forced exploration and complement sampling, yielding sublinear regret $O\left(\sqrt{N T \log N T} + N \log^2(NT) + N T^{1-\alpha}\right)$ and estimation rates for $\mathbf{v}$ and revenues of order $O\left(\sqrt{T^{\alpha-1}}\right)$, for $\alpha \in [0,1/2]$. They establish necessary and sufficient conditions for approximate Pareto optimality on both revenue and attraction-parameter estimation fronts, and derive information-theoretic lower bounds to demonstrate optimality on the Pareto frontier. The framework also extends to constraints on assortment size via Algorithm 2, preserving AP guarantees, and is illustrated with synthetic experiments showing favorable regret and inference performance. Overall, the paper provides a rigorous, scalable approach to balancing short-term revenue and long-term learning in combinatorial choice models with practical implications for online recommender and retail systems.
Abstract
We provide a new online learning algorithm for tackling the Multinomial Logit Bandit (MNL-Bandit) problem. Despite the challenges posed by the combinatorial nature of the MNL model, we develop a novel Upper Confidence Bound (UCB)-based method that achieves Approximate Pareto Optimality by balancing regret minimization and estimation error of the assortment revenues and the MNL parameters. We develop theoretical guarantees characterizing the tradeoff between regret and estimation error for the MNL-Bandit problem through information-theoretic bounds, and propose a modified UCB algorithm that incorporates forced exploration to improve parameter estimation accuracy while maintaining low regret. Our analysis sheds critical insights into how to optimally balance the collected revenues and the treatment estimation in dynamic assortment optimization.
