The Effects of Data Split Strategies on the Offline Experiments for CTR Prediction
Ramazan Tarik Turksoy, Beyza Turkmen
TL;DR
The paper investigates how offline evaluation practices for CTR prediction influence model selection, showing that random data splits can cause data leakage and overoptimistic performance estimates compared with temporal splits that reflect real-world, time-evolving data. By benchmarking 12 deep-CTR models on the Criteo dataset under both split strategies, it demonstrates significant differences in model rankings and reveals a weak correspondence between offline rankings and prospective online performance. The results highlight the presence of concept drift and the limitations of traditional random splits, arguing for temporal splits as a more realism-grounded evaluation protocol. The work contributes to more trustworthy model selection in CTR tasks and points toward integrating temporal considerations into offline evaluation and subsequent online validation to ensure practical benefits.
Abstract
Click-through rate (CTR) prediction is a crucial task in online advertising to recommend products that users are likely to be interested in. To identify the best-performing models, rigorous model evaluation is necessary. Offline experimentation plays a significant role in selecting models for live user-item interactions, despite the value of online experimentation like A/B testing, which has its own limitations and risks. Often, the correlation between offline performance metrics and actual online model performance is inadequate. One main reason for this discrepancy is the common practice of using random splits to create training, validation, and test datasets in CTR prediction. In contrast, real-world CTR prediction follows a temporal order. Therefore, the methodology used in offline evaluation, particularly the data splitting strategy, is crucial. This study aims to address the inconsistency between current offline evaluation methods and real-world use cases, by focusing on data splitting strategies. To examine the impact of different data split strategies on offline performance, we conduct extensive experiments using both random and temporal splits on a large open benchmark dataset, Criteo.
