Table of Contents
Fetching ...

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Priyank Agrawal, Theja Tulabandhula, Vashist Avadhanula

TL;DR

The paper tackles online learning for contextual multinomial logit bandits in dynamic assortment optimization, where item utilities are linear in attributes. It introduces CB-MNL, an optimistic, curvature-aware algorithm that uses Bernstein-style concentration and a convex relaxation to enable tractable decisions. The main result is a regret bound of $\tilde{O}(d\sqrt{T} + \kappa)$, substantially reducing the dependence on the problem-dependent parameter $\kappa$ seen in prior work. The analysis leverages self-concordance properties of the MNL link to bound estimation and prediction errors, and demonstrates that the convex relaxation preserves the regret guarantees. Empirical results show robust performance across different $\kappa$ regimes and problem scales, underscoring practical impact for revenue management and contextual recommendation scenarios.

Abstract

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{κd T})$, where $κ$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + κ)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

TL;DR

The paper tackles online learning for contextual multinomial logit bandits in dynamic assortment optimization, where item utilities are linear in attributes. It introduces CB-MNL, an optimistic, curvature-aware algorithm that uses Bernstein-style concentration and a convex relaxation to enable tractable decisions. The main result is a regret bound of , substantially reducing the dependence on the problem-dependent parameter seen in prior work. The analysis leverages self-concordance properties of the MNL link to bound estimation and prediction errors, and demonstrates that the convex relaxation preserves the regret guarantees. Empirical results show robust performance across different regimes and problem scales, underscoring practical impact for revenue management and contextual recommendation scenarios.

Abstract

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon . Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by , where is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by , significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.

Paper Structure

This paper contains 30 sections, 18 theorems, 108 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

With probability at least $1-\delta$ over the randomness of user choices: where the constants are given as $C_1 = (4+8S)$, $C_2= 4(4+8S)^{3/2}M$, and $\gamma_T(\delta)$ is given by Eq (eq: gamma value main).

Figures (3)

  • Figure 1: Illustration of the impact of the $\kappa$ parameter (logistic case, multinomial logit case closely follows): A representative plot of the derivative of the reward function. The x-axis represents the linear function $x^\top\theta$ and the y-axis is proportional to $1/\kappa$. Parameter $\kappa$ is small only in the narrow region around $0$ and grows arbitrarily large depending on the problem instance (captured by $x^\top\theta$ values).
  • Figure 2: Comparison of cumulative regret as a function of time for varying $\kappa$ ( left to right: $\kappa \gg \del{\sqrt{T}}$, $\kappa < \del{\sqrt{T}}$, and $\kappa \ll \del{\sqrt{T}}$)
  • Figure 3: Comparison of cumulative regret for two additional parameter instance ( left: $\kappa \gg \del{\sqrt{T}}$, $N=10,d=3,K=6,T=100$; right: $\kappa \gg \del{\sqrt{T}}$, $N=20,d=3,K=5,T=100$

Theorems & Definitions (36)

  • Remark 1: Optimistic parameter search
  • Remark 2: Tractable decision-making
  • Theorem 1
  • Corollary 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • ...and 26 more