DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

Jason Shuo Zhang; Benjamin Howson; Panayiota Savva; Eleanor Loh

DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

Jason Shuo Zhang, Benjamin Howson, Panayiota Savva, Eleanor Loh

TL;DR

DISCO addresses the challenge of allocating personalised discount codes under partial information by integrating low-dimensional action representations with neural context embeddings in a Bayesian contextual bandit framework. Actions are encoded using radial basis functions and learned within a Thompson Sampling scheme, with a constrained integer program enforcing operational budget and distribution requirements. The reward model preserves negative price elasticity and enables extrapolation to unseen actions, demonstrated through offline analyses and a large online A/B test showing improvements over legacy and undifferentiated campaigns. The approach is data-efficient, scalable, and broadly applicable to other personalisation problems where global constraints shape action choices.

Abstract

Personalised discount codes provide a powerful mechanism for managing customer relationships and operational spend in e-commerce. Bandits are well suited for this product area, given the partial information nature of the problem, as well as the need for adaptation to the changing business environment. Here, we introduce DISCO, an end-to-end contextual bandit framework for personalised discount code allocation at ASOS. DISCO adapts the traditional Thompson Sampling algorithm by integrating it within an integer program, thereby allowing for operational cost control. Because bandit learning is often worse with high dimensional actions, we focused on building low dimensional action and context representations that were nonetheless capable of good accuracy. Additionally, we sought to build a model that preserved the relationship between price and sales, in which customers increasing their purchasing in response to lower prices ("negative price elasticity"). These aims were achieved by using radial basis functions to represent the continuous (i.e. infinite armed) action space, in combination with context embeddings extracted from a neural network. These feature representations were used within a Thompson Sampling framework to facilitate exploration, and further integrated with an integer program to allocate discount codes across ASOS's customer base. These modelling decisions result in a reward model that (a) enables pooled learning across similar actions, (b) is highly accurate, including in extrapolation, and (c) preserves the expected negative price elasticity. Through offline analysis, we show that DISCO is able to effectively enact exploration and improves its performance over time, despite the global constraint. Finally, we subjected DISCO to a rigorous online A/B test, and find that it achieves a significant improvement of >1% in average basket value, relative to the legacy systems.

DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

TL;DR

Abstract

Paper Structure (15 sections, 8 equations, 4 figures)

This paper contains 15 sections, 8 equations, 4 figures.

Introduction
Problem formulation
DISCO ARCHITECTURE
Action feature representation
Context feature representation
Reward prediction: Bayesian log-linear regression
Reward sampling
Optimisation of discount code allocation
Experiments
Information sharing and price elasticity with RBF encoding
Reward prediction model
Active learning with global constraints
Online A/B Test
Concluding Discussion
Acknowledgments.

Figures (4)

Figure 1: Overview of DISCO. DISCO uses low dimensional context embeddings (from a neural network) alongside radial basis functions that represent a continuous action space with low cardinality. These action representations enable pooled learning across similar actions. Features are used within a Bayesian log-linear regression to predict basket-level revenue (the reward signal). Constrained integer programming is then used to allocate discounts with operational control.
Figure 2: Action encoding mechanism. The left figure illustrates a 3-dim encoding of each discount depth from 0.0 to 1.0 using the RBF transformation with three basis locations (0.25, 0.5, 0.75). This encoding mechanism leads to information sharing as measured by the effective number of times the algorithm has selected each action for a fixed context, depicted in the middle figure. The right figure demonstrates the uncertainty (standard deviation; SD) in the reward model adapts to increasing exposure to different regions of the action space, including regions that are unrepresented in the training data (extrapolation/interpolation; shaded in pink). Each line shows the uncertainty over 1K randomly selected customers, where the model is trained on different volumes of data. As the volume of data increases, the model retains greater uncertainty for the previously unseen extrapolation range $a < 0.6$. Meanwhile, its confidence still incrementally increases due to the RBF's information sharing.
Figure 3: Negative price elasticity. The left figure shows the observed relationship between discounting and full-price basket values, which is in line with the conventional assumption of price elasticity. Monotonicity is expected and observed only when looking at full-price basket values, not discounted ones. The middle figure demonstrates different action encoding mechanisms and their effects. An RBF encoding scheme with $K=3$ centroids and $\alpha=20$ demonstrates the desired near-monotonic relationship between the actions and their corresponding effects. On the right figure, the chosen action encoding scheme (K=3, $\alpha=20$) produced the expected monotonicity as used in the overall Bayesian log-linear reward model, both overall (blue; CIs indicate 95% CI of the mean) as well as for 3 randomly selected customers.
Figure 4: Evaluation of bandit algorithms. Performance of different constrained agents under warm- (left) and cold-start (middle) scenarios. TS-IP demonstrates the strongest long-term performance, while UCB-IP's long term performance is notably hampered. The right figure compares TS-IP to a TS-ULCC benchmark ("Unconstrained Learner, Constrained Consumer"; warm start). In this benchmark, "exploitative" actions are IP-constrained, but separate "explorative" actions are taken to update the model without consuming rewards. The consumed rewards reported earlier come from IP-constrained actions, using a predictive model enhanced by unconstrained-action updates over time. Benchmarking against TS-ULCC quantifies how much TS-IP’s long-term performance is affected by the inability to choose actions across the full action space (due to the IP constraint), while considering practical action constraints related to harvested rewards in each round. Although TS-IP’s long-term performance is slightly degraded compared to ULCC's idealized benchmark, the degradation is minimal (0.234%) and does not significantly escalate over 100 rounds of learning. This indicates that the IP constraint does not have an unacceptably harmful effect on DISCO’s active learning capabilities.

DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

TL;DR

Abstract

DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)