Table of Contents
Fetching ...

Thompson Sampling for Repeated Newsvendor

Li Chen, Hanzhang Qin, Yunbei Xu, Ruihao Zhu, Weizhou Zhang

TL;DR

The paper tackles online learning for the repeated newsvendor with censored feedback, addressing how to balance exploration and exploitation when demand is only partially observed. It proposes Thompson Sampling (TS) with a Gamma prior on Weibull-demand parameters, deriving frequentist regret bounds that scale as $\tilde{O}(\sqrt{T})$ and providing insights into how censoring shapes learning. Extending beyond Weibull, the authors develop a Bayesian regret analysis for general parametric families using Kaplan–Meier estimators and plug-in selection, accompanied by Lipschitz-based regret bounds. Empirically, TS outperforms online convex optimization, upper confidence bounds, and myopic dynamic programming across various service levels, highlighting TS’s practical value for inventory decisions under censored information.

Abstract

In this paper, we investigate the performance of Thompson Sampling (TS) for online learning with censored feedback, focusing primarily on the classic repeated newsvendor model--a foundational framework in inventory management--and demonstrating how our techniques can be naturally extended to a broader class of problems. We first model demand using a Weibull distribution and initialize TS with a Gamma prior to dynamically adjust order quantities. Our analysis establishes optimal (up to logarithmic factors) frequentist regret bounds for TS without imposing restrictive prior assumptions. More importantly, it yields novel and highly interpretable insights on how TS addresses the exploration-exploitation trade-off in the repeated newsvendor setting. Specifically, our results show that when past order quantities are sufficiently large to overcome censoring, TS accurately estimates the unknown demand parameters, leading to near-optimal ordering decisions. Conversely, when past orders are relatively small, TS automatically increases future order quantities to gather additional demand information. Then, we extend our analysis to general parametric distribution family and provide proof for Bayesian regret. Extensive numerical simulations further demonstrate that TS outperforms more conservative and widely-used approaches such as online convex optimization, upper confidence bounds, and myopic Bayesian dynamic programming.

Thompson Sampling for Repeated Newsvendor

TL;DR

The paper tackles online learning for the repeated newsvendor with censored feedback, addressing how to balance exploration and exploitation when demand is only partially observed. It proposes Thompson Sampling (TS) with a Gamma prior on Weibull-demand parameters, deriving frequentist regret bounds that scale as and providing insights into how censoring shapes learning. Extending beyond Weibull, the authors develop a Bayesian regret analysis for general parametric families using Kaplan–Meier estimators and plug-in selection, accompanied by Lipschitz-based regret bounds. Empirically, TS outperforms online convex optimization, upper confidence bounds, and myopic dynamic programming across various service levels, highlighting TS’s practical value for inventory decisions under censored information.

Abstract

In this paper, we investigate the performance of Thompson Sampling (TS) for online learning with censored feedback, focusing primarily on the classic repeated newsvendor model--a foundational framework in inventory management--and demonstrating how our techniques can be naturally extended to a broader class of problems. We first model demand using a Weibull distribution and initialize TS with a Gamma prior to dynamically adjust order quantities. Our analysis establishes optimal (up to logarithmic factors) frequentist regret bounds for TS without imposing restrictive prior assumptions. More importantly, it yields novel and highly interpretable insights on how TS addresses the exploration-exploitation trade-off in the repeated newsvendor setting. Specifically, our results show that when past order quantities are sufficiently large to overcome censoring, TS accurately estimates the unknown demand parameters, leading to near-optimal ordering decisions. Conversely, when past orders are relatively small, TS automatically increases future order quantities to gather additional demand information. Then, we extend our analysis to general parametric distribution family and provide proof for Bayesian regret. Extensive numerical simulations further demonstrate that TS outperforms more conservative and widely-used approaches such as online convex optimization, upper confidence bounds, and myopic Bayesian dynamic programming.

Paper Structure

This paper contains 42 sections, 14 theorems, 65 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

$T$-period regret of a given $\theta_{\star}$ for repeated newsvendor problem is

Figures (4)

  • Figure 1: Compare TS with OCO and UCB
  • Figure 2: Compare TS with Myopic Policy
  • Figure 3: (Normal Distribution) Compare TS with OCO and UCB
  • Figure 4: (Normal Distribution) Compare TS with Myopic

Theorems & Definitions (26)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6: chuang2023bayesian
  • Definition 3
  • ...and 16 more