Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators
Bruno Fava
TL;DR
The paper tackles the challenge of training a model on the same data used to evaluate its properties by introducing an inference framework that averages over multiple split-sample procedures (training on part of the data and evaluating on the rest) and uses the entire sample for evaluation. It proves a new central limit theorem for split-sample Z-estimators under mild conditions, accommodating arbitrary model complexity and convergence rates, and derives valid confidence intervals and a reproducibility measure for p-values. The work further develops a specialized inference approach for model comparisons that accounts for cross-split dependence and applies these methods to poverty forecasting in Ghana and learning heterogeneous treatment effects in randomized experiments, showing improved power and reproducibility. It also introduces a novel ensemble method for GATES-style HTE analysis, combining multiple ML predictors to enhance power while using the full sample for evaluation. The results offer actionable guidance on choosing the number of folds and repetitions and provide practical tools for policy-relevant decision-making with data-dependent, high-dimensional models.
Abstract
As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate subsamples to estimate the model and to evaluate it. However, this approach has two drawbacks, since each task uses only part of the data, and different splits can lead to widely different estimates. Averaging across multiple splits, I develop an inference approach that uses more data for training, uses the entire sample for testing, and improves reproducibility. I address the statistical dependence from reusing observations across splits by proving a new central limit theorem for a large class of split-sample estimators under arguably mild and general conditions. Importantly, I make no restrictions on model complexity or convergence rates. I show that confidence intervals based on the normal approximation are valid for many applications, but may undercover in important cases of interest, such as comparing the performance between two models. I develop a new inference approach for such cases, explicitly accounting for the dependence across splits. Moreover, I provide a measure of reproducibility for p-values obtained from split-sample estimators. Finally, I apply my results to two important problems in development and public economics: predicting poverty and learning heterogeneous treatment effects in randomized experiments. I show that my inference approach with repeated cross-fitting achieves better power than existing alternatives, often enough to reveal statistical significance that would otherwise be missed.
