A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit
Priyank Agrawal, Theja Tulabandhula, Vashist Avadhanula
TL;DR
The paper tackles online learning for contextual multinomial logit bandits in dynamic assortment optimization, where item utilities are linear in attributes. It introduces CB-MNL, an optimistic, curvature-aware algorithm that uses Bernstein-style concentration and a convex relaxation to enable tractable decisions. The main result is a regret bound of $\tilde{O}(d\sqrt{T} + \kappa)$, substantially reducing the dependence on the problem-dependent parameter $\kappa$ seen in prior work. The analysis leverages self-concordance properties of the MNL link to bound estimation and prediction errors, and demonstrates that the convex relaxation preserves the regret guarantees. Empirical results show robust performance across different $\kappa$ regimes and problem scales, underscoring practical impact for revenue management and contextual recommendation scenarios.
Abstract
In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{κd T})$, where $κ$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + κ)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.
