Era Splitting: Invariant Learning for Decision Trees
Timothy DeLise
TL;DR
The paper tackles out-of-distribution generalization in regression tasks by introducing era-aware split criteria for gradient boosted decision trees. Era Splitting computes per-era impurity reductions and aggregates them with a Boltzmann operator to encourage invariance across eras, while Directional Era Splitting augments this with cross-era agreement of split directions. The authors validate the approach on four datasets—Shifted Sine Wave, Synthetic Memorization, Camelyon17, and Numerai—showing reduced in-sample overfitting and improved out-of-sample performance, with directional era splitting often delivering the best results. Public code releases enable practitioners to apply invariant learning concepts to tree ensembles in settings with clear domain shifts in tabular data.
Abstract
Real-life machine learning problems exhibit distributional shifts in the data from one time to another or from one place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate "environmental", or "era-wise" information into the algorithms. So far, most research has been focused on linear models and/or neural networks . In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, namely, gradient boosting decision trees (GBDTs). The new splitting criteria use era-wise information associated with the data to grow tree-based models that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. In this paper, two new splitting criteria are defined and analyzed theoretically. Effectiveness is tested on four experiments, ranging from simple, synthetic to complex, real-world applications. In particular we cast the OOD domain-adaptation problem in the context of financial markets, where the new models out-perform state-of-the-art GBDT models on the Numerai data set. The new criteria are incorporated into the Scikit-Learn code base and made freely available online.
