Benchmarking for Deep Uplift Modeling in Online Marketing
Dugang Liu, Xing Tang, Yang Qiao, Miao Liu, Zexu Sun, Xiuqiang He, Zhong Ming
TL;DR
This work introduces DUMOM, the first open benchmark for deep uplift modeling, addressing reproducibility and fair comparison issues that have hindered progress in online marketing uplift estimation. By evaluating 13 representative DUM models on two real-world datasets (Criteo and Lazada) under four preprocessing settings, the study reveals that no single model consistently outperforms others across distributions, and that data preprocessing (feature normalization and instance deduplication) significantly shapes results. The benchmark provides detailed evaluation protocols, hyperparameter tuning, and implementation details to facilitate rapid, fair comparisons of new methods, while offering practitioners guidance on model selection and preprocessing in deployment. Overall, the paper highlights generalization limitations of current DUM approaches and calls for more robust, distribution-aware modeling and broader datasets to advance practical uplift modeling. The work aims to catalyze reproducible research and real-world impact in deep uplift modeling by making benchmarks and results openly available on GitHub.
Abstract
Online marketing is critical for many industrial platforms and business applications, aiming to increase user engagement and platform revenue by identifying corresponding delivery-sensitive groups for specific incentives, such as coupons and bonuses. As the scale and complexity of features in industrial scenarios increase, deep uplift modeling (DUM) as a promising technique has attracted increased research from academia and industry, resulting in various predictive models. However, current DUM still lacks some standardized benchmarks and unified evaluation protocols, which limit the reproducibility of experimental results in existing studies and the practical value and potential impact in this direction. In this paper, we provide an open benchmark for DUM and present comparison results of existing models in a reproducible and uniform manner. To this end, we conduct extensive experiments on two representative industrial datasets with different preprocessing settings to re-evaluate 13 existing models. Surprisingly, our experimental results show that the most recent work differs less than expected from traditional work in many cases. In addition, our experiments also reveal the limitations of DUM in generalization, especially for different preprocessing and test distributions. Our benchmarking work allows researchers to evaluate the performance of new models quickly but also reasonably demonstrates fair comparison results with existing models. It also gives practitioners valuable insights into often overlooked considerations when deploying DUM. We will make this benchmarking library, evaluation protocol, and experimental setup available on GitHub.
