Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype
Nikola Tankovic, Robert Sajina
TL;DR
Contextual Multi-Armed Bandits (CMABs) enable context-aware sequential decision-making but face trade-offs between scalability and interpretability in dynamic retail settings. The paper presents a scalable, interpretable offer-selection prototype that operates on category-level contexts using Member Purchase Gap (MPG), Matrix Factorization (MF) signals, and logistic regression learned via SGD, with Beta-based exploration and explicit weight trajectories exposed to large language models (LLMs) for explanations. It surveys key CMAB families (UCB, Epsilon-Greedy, Posterior Sampling) and GLMs, situating the prototype among established paradigms like LinUCB and Thompson Sampling, and demonstrates a practical reference implementation that emphasizes interpretability without resorting to full neural-bandit complexity. The work contributes a controllable baseline for understanding bandit behavior at scale, offers a path to transparent, data-efficient personalized offer optimization, and outlines concrete steps toward production-ready deployment with richer representations and non-stationarity handling.
Abstract
This paper presents a concise review of Contextual Multi-Armed Bandit (CMAB) methods and introduces an experimental framework for scalable, interpretable offer selection, addressing the challenge of fast-changing offers. The approach models context at the product category level, allowing offers to span multiple categories and enabling knowledge transfer across similar offers. This improves learning efficiency and generalization in dynamic environments. The framework extends standard CMAB methodology to support multi-category contexts, and achieves scalability through efficient feature engineering and modular design. Advanced features such as MPG (Member Purchase Gap) and MF (Matrix Factorization) capture nuanced user-offer interactions, with implementation in Python for practical deployment. A key contribution is interpretability at scale: logistic regression models yield transparent weight vectors, accessible via a large language model (LLM) interface for real-time, user-level tracking and explanation of evolving preferences. This enables the generation of detailed member profiles and identification of behavioral patterns, supporting personalized offer optimization and enhancing trust in automated decisions. By situating our prototype alongside established paradigms like Generalized Linear Models and Thompson Sampling, we demonstrate its value for both research and real-world CMAB applications.
