Table of Contents
Fetching ...

MNL-Bandit with Knapsacks: a near-optimal algorithm

Abdellah Aznag, Vineet Goyal, Noemie Perivier

TL;DR

This work analyzes dynamic assortment optimization under finite inventory with unknown customer preferences modeled by a multinomial logit (MNL) choice model. The authors introduce MNLwK-UCB, a UCB-based algorithm that operates in epochs and uses optimistic confidence bounds to solve a fluid relaxation with a distribution over feasible assortments, ensuring feasible inventory consumption. They derive regret bounds that scale with inventory through the term $r_{\text{inv}}$ and exhibit a near-optimal rate $\tilde{O}(\sqrt{NT})$ across regimes, including large inventories and sublinear growth $q_i = \Theta(T^{\alpha})$, $\alpha<1$. The analysis decomposes regret into estimation, randomness, and mis-specification components, and leverages concentration inequalities and a novel epoch-based bounding strategy to show the stopping time equals $T$ with high probability, yielding practical near-optimal performance without exponential action-space complexity.

Abstract

We consider a dynamic assortment selection problem where a seller has a fixed inventory of $N$ substitutable products and faces an unknown demand that arrives sequentially over $T$ periods. In each period, the seller needs to decide on the assortment of products (satisfying certain constraints) to offer to the customers. The customer's response follows an unknown multinomial logit model (MNL) with parameter $\boldsymbol{v}$. If customer selects product $i \in [N]$, the seller receives revenue $r_i$. The goal of the seller is to maximize the total expected revenue from the $T$ customers given the fixed initial inventory of $N$ products. We present MNLwK-UCB, a UCB-based algorithm and characterize its regret under different regimes of inventory size. We show that when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a $\tilde{O}(N + \sqrt{NT})$ regret bound. We also show that for a smaller inventory (with growth $\sim T^α$, $α< 1$), MNLwK-UCB achieves a $\tilde{O}(N(1 + T^{\frac{1 - α}{2}}) + \sqrt{NT})$. In particular, over a long time horizon $T$, the rate $\tilde{O}(\sqrt{NT})$ is always achieved regardless of the constraints and the size of the inventory.

MNL-Bandit with Knapsacks: a near-optimal algorithm

TL;DR

This work analyzes dynamic assortment optimization under finite inventory with unknown customer preferences modeled by a multinomial logit (MNL) choice model. The authors introduce MNLwK-UCB, a UCB-based algorithm that operates in epochs and uses optimistic confidence bounds to solve a fluid relaxation with a distribution over feasible assortments, ensuring feasible inventory consumption. They derive regret bounds that scale with inventory through the term and exhibit a near-optimal rate across regimes, including large inventories and sublinear growth , . The analysis decomposes regret into estimation, randomness, and mis-specification components, and leverages concentration inequalities and a novel epoch-based bounding strategy to show the stopping time equals with high probability, yielding practical near-optimal performance without exponential action-space complexity.

Abstract

We consider a dynamic assortment selection problem where a seller has a fixed inventory of substitutable products and faces an unknown demand that arrives sequentially over periods. In each period, the seller needs to decide on the assortment of products (satisfying certain constraints) to offer to the customers. The customer's response follows an unknown multinomial logit model (MNL) with parameter . If customer selects product , the seller receives revenue . The goal of the seller is to maximize the total expected revenue from the customers given the fixed initial inventory of products. We present MNLwK-UCB, a UCB-based algorithm and characterize its regret under different regimes of inventory size. We show that when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a regret bound. We also show that for a smaller inventory (with growth , ), MNLwK-UCB achieves a . In particular, over a long time horizon , the rate is always achieved regardless of the constraints and the size of the inventory.

Paper Structure

This paper contains 18 sections, 19 theorems, 136 equations, 1 table, 1 algorithm.

Key Result

Lemma 1

For any non-anticipative policy $\mathcal{P}$, $\mathbb{E}_{\mathcal{P}}\sum_{t = 1}^T r_{c_t} \leq T r_{\sf opt}$.

Theorems & Definitions (30)

  • Lemma 1
  • Theorem 1
  • Lemma 2: DBLP:journals/corr/AgrawalAGZ17a
  • Lemma 3: Confidence bounds for $\boldsymbol{v}$
  • Lemma 4
  • Proposition 1
  • Lemma 5: Bounding $\delta_{i, T}^{\sf random}$
  • Lemma 6: Bounding $\delta_{i, T}^{\sf mnl}$
  • Lemma 7
  • Lemma 8
  • ...and 20 more