MNL-Bandit with Knapsacks: a near-optimal algorithm
Abdellah Aznag, Vineet Goyal, Noemie Perivier
TL;DR
This work analyzes dynamic assortment optimization under finite inventory with unknown customer preferences modeled by a multinomial logit (MNL) choice model. The authors introduce MNLwK-UCB, a UCB-based algorithm that operates in epochs and uses optimistic confidence bounds to solve a fluid relaxation with a distribution over feasible assortments, ensuring feasible inventory consumption. They derive regret bounds that scale with inventory through the term $r_{\text{inv}}$ and exhibit a near-optimal rate $\tilde{O}(\sqrt{NT})$ across regimes, including large inventories and sublinear growth $q_i = \Theta(T^{\alpha})$, $\alpha<1$. The analysis decomposes regret into estimation, randomness, and mis-specification components, and leverages concentration inequalities and a novel epoch-based bounding strategy to show the stopping time equals $T$ with high probability, yielding practical near-optimal performance without exponential action-space complexity.
Abstract
We consider a dynamic assortment selection problem where a seller has a fixed inventory of $N$ substitutable products and faces an unknown demand that arrives sequentially over $T$ periods. In each period, the seller needs to decide on the assortment of products (satisfying certain constraints) to offer to the customers. The customer's response follows an unknown multinomial logit model (MNL) with parameter $\boldsymbol{v}$. If customer selects product $i \in [N]$, the seller receives revenue $r_i$. The goal of the seller is to maximize the total expected revenue from the $T$ customers given the fixed initial inventory of $N$ products. We present MNLwK-UCB, a UCB-based algorithm and characterize its regret under different regimes of inventory size. We show that when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a $\tilde{O}(N + \sqrt{NT})$ regret bound. We also show that for a smaller inventory (with growth $\sim T^α$, $α< 1$), MNLwK-UCB achieves a $\tilde{O}(N(1 + T^{\frac{1 - α}{2}}) + \sqrt{NT})$. In particular, over a long time horizon $T$, the rate $\tilde{O}(\sqrt{NT})$ is always achieved regardless of the constraints and the size of the inventory.
