Benchmarking for Deep Uplift Modeling in Online Marketing

Dugang Liu; Xing Tang; Yang Qiao; Miao Liu; Zexu Sun; Xiuqiang He; Zhong Ming

Benchmarking for Deep Uplift Modeling in Online Marketing

Dugang Liu, Xing Tang, Yang Qiao, Miao Liu, Zexu Sun, Xiuqiang He, Zhong Ming

TL;DR

This work introduces DUMOM, the first open benchmark for deep uplift modeling, addressing reproducibility and fair comparison issues that have hindered progress in online marketing uplift estimation. By evaluating 13 representative DUM models on two real-world datasets (Criteo and Lazada) under four preprocessing settings, the study reveals that no single model consistently outperforms others across distributions, and that data preprocessing (feature normalization and instance deduplication) significantly shapes results. The benchmark provides detailed evaluation protocols, hyperparameter tuning, and implementation details to facilitate rapid, fair comparisons of new methods, while offering practitioners guidance on model selection and preprocessing in deployment. Overall, the paper highlights generalization limitations of current DUM approaches and calls for more robust, distribution-aware modeling and broader datasets to advance practical uplift modeling. The work aims to catalyze reproducible research and real-world impact in deep uplift modeling by making benchmarks and results openly available on GitHub.

Abstract

Online marketing is critical for many industrial platforms and business applications, aiming to increase user engagement and platform revenue by identifying corresponding delivery-sensitive groups for specific incentives, such as coupons and bonuses. As the scale and complexity of features in industrial scenarios increase, deep uplift modeling (DUM) as a promising technique has attracted increased research from academia and industry, resulting in various predictive models. However, current DUM still lacks some standardized benchmarks and unified evaluation protocols, which limit the reproducibility of experimental results in existing studies and the practical value and potential impact in this direction. In this paper, we provide an open benchmark for DUM and present comparison results of existing models in a reproducible and uniform manner. To this end, we conduct extensive experiments on two representative industrial datasets with different preprocessing settings to re-evaluate 13 existing models. Surprisingly, our experimental results show that the most recent work differs less than expected from traditional work in many cases. In addition, our experiments also reveal the limitations of DUM in generalization, especially for different preprocessing and test distributions. Our benchmarking work allows researchers to evaluate the performance of new models quickly but also reasonably demonstrates fair comparison results with existing models. It also gives practitioners valuable insights into often overlooked considerations when deploying DUM. We will make this benchmarking library, evaluation protocol, and experimental setup available on GitHub.

Benchmarking for Deep Uplift Modeling in Online Marketing

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 5 figures, 7 tables)

This paper contains 36 sections, 6 equations, 5 figures, 7 tables.

Introduction
Deep Uplift Modeling
Architecture Overview
Objective of Uplift Modeling
Response Modeling of the Control Group
Response Modeling of the Treatment Group
Predictive Modeling of Treatment Indicator Variables
Loss Function
Inference Stage
Representative Models
Treatment as a Branch Switch
Treatments as Model Features
Overview of Representative Models
The Connection between DUM and ITE
Open DUM Benchmark (DUMOM)
...and 21 more sections

Figures (5)

Figure 1: The architectural illustration of representative baselines in neural network-based uplift modeling. For more information on the meaning of some baseline-specific symbols, please refer to their original papers.
Figure 2: The impact of whether to perform instance deduplication (i.e., "w/ ID" or "w/o ID") on two benchmark datasets when feature normalization is not performed (i.e., "w/o FN").
Figure 3: The impact of whether to perform instance deduplication (i.e., "w/ ID" or "w/o ID") on two benchmark datasets when feature normalization is performed (i.e., "w/ FN").
Figure 4: The impact of whether to perform feature normalization (i.e., "w/ FN" or "w/o FN") on two benchmark datasets when instance deduplication is not performed (i.e., "w/o ID").
Figure 5: The impact of whether to perform feature normalization (i.e., "w/ FN" or "w/o FN") on two benchmark datasets when instance deduplication is performed (i.e., "w/ ID").

Benchmarking for Deep Uplift Modeling in Online Marketing

TL;DR

Abstract

Benchmarking for Deep Uplift Modeling in Online Marketing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)