DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation
Jason Shuo Zhang, Benjamin Howson, Panayiota Savva, Eleanor Loh
TL;DR
DISCO addresses the challenge of allocating personalised discount codes under partial information by integrating low-dimensional action representations with neural context embeddings in a Bayesian contextual bandit framework. Actions are encoded using radial basis functions and learned within a Thompson Sampling scheme, with a constrained integer program enforcing operational budget and distribution requirements. The reward model preserves negative price elasticity and enables extrapolation to unseen actions, demonstrated through offline analyses and a large online A/B test showing improvements over legacy and undifferentiated campaigns. The approach is data-efficient, scalable, and broadly applicable to other personalisation problems where global constraints shape action choices.
Abstract
Personalised discount codes provide a powerful mechanism for managing customer relationships and operational spend in e-commerce. Bandits are well suited for this product area, given the partial information nature of the problem, as well as the need for adaptation to the changing business environment. Here, we introduce DISCO, an end-to-end contextual bandit framework for personalised discount code allocation at ASOS. DISCO adapts the traditional Thompson Sampling algorithm by integrating it within an integer program, thereby allowing for operational cost control. Because bandit learning is often worse with high dimensional actions, we focused on building low dimensional action and context representations that were nonetheless capable of good accuracy. Additionally, we sought to build a model that preserved the relationship between price and sales, in which customers increasing their purchasing in response to lower prices ("negative price elasticity"). These aims were achieved by using radial basis functions to represent the continuous (i.e. infinite armed) action space, in combination with context embeddings extracted from a neural network. These feature representations were used within a Thompson Sampling framework to facilitate exploration, and further integrated with an integer program to allocate discount codes across ASOS's customer base. These modelling decisions result in a reward model that (a) enables pooled learning across similar actions, (b) is highly accurate, including in extrapolation, and (c) preserves the expected negative price elasticity. Through offline analysis, we show that DISCO is able to effectively enact exploration and improves its performance over time, despite the global constraint. Finally, we subjected DISCO to a rigorous online A/B test, and find that it achieves a significant improvement of >1% in average basket value, relative to the legacy systems.
