Transfer Learning for Nonparametric Contextual Dynamic Pricing
Fan Wang, Feiyu Jiang, Zifeng Zhao, Yi Yu
TL;DR
This work tackles nonparametric contextual dynamic pricing when target data are limited by leveraging related pre-collected data from a source domain under covariate shift. It introduces the TLDP algorithm, which adaptively partitions the joint covariate-price space and uses source data to bootstrap exploration, achieving a regret that scales as $R \lesssim n_Q\left(n_Q+(\kappa n_P)^{(d+3)/(d+3+\gamma)}\right)^{-1/(d+3)}$ up to logarithmic factors. The authors prove a matching minimax lower bound, establishing the minimax optimality of TLDP and quantifying how transfer quality, through the transfer exponent $\gamma$ and exploration coefficient $\kappa$, influences performance. They validate the approach with synthetic and real data, showing TLDP's practical benefits in reducing target regret and its robustness to transfer settings. Overall, the paper provides a rigorous foundation for transfer learning in a continuous-action, nonparametric pricing context and offers actionable guidance for deploying cross-domain pricing strategies.
Abstract
Dynamic pricing strategies are crucial for firms to maximize revenue by adjusting prices based on market conditions and customer characteristics. However, designing optimal pricing strategies becomes challenging when historical data are limited, as is often the case when launching new products or entering new markets. One promising approach to overcome this limitation is to leverage information from related products or markets to inform the focal pricing decisions. In this paper, we explore transfer learning for nonparametric contextual dynamic pricing under a covariate shift model, where the marginal distributions of covariates differ between source and target domains while the reward functions remain the same. We propose a novel Transfer Learning for Dynamic Pricing (TLDP) algorithm that can effectively leverage pre-collected data from a source domain to enhance pricing decisions in the target domain. The regret upper bound of TLDP is established under a simple Lipschitz condition on the reward function. To establish the optimality of TLDP, we further derive a matching minimax lower bound, which includes the target-only scenario as a special case and is presented for the first time in the literature. Extensive numerical experiments validate our approach, demonstrating its superiority over existing methods and highlighting its practical utility in real-world applications.
