Table of Contents
Fetching ...

Era Splitting: Invariant Learning for Decision Trees

Timothy DeLise

TL;DR

The paper tackles out-of-distribution generalization in regression tasks by introducing era-aware split criteria for gradient boosted decision trees. Era Splitting computes per-era impurity reductions and aggregates them with a Boltzmann operator to encourage invariance across eras, while Directional Era Splitting augments this with cross-era agreement of split directions. The authors validate the approach on four datasets—Shifted Sine Wave, Synthetic Memorization, Camelyon17, and Numerai—showing reduced in-sample overfitting and improved out-of-sample performance, with directional era splitting often delivering the best results. Public code releases enable practitioners to apply invariant learning concepts to tree ensembles in settings with clear domain shifts in tabular data.

Abstract

Real-life machine learning problems exhibit distributional shifts in the data from one time to another or from one place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate "environmental", or "era-wise" information into the algorithms. So far, most research has been focused on linear models and/or neural networks . In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, namely, gradient boosting decision trees (GBDTs). The new splitting criteria use era-wise information associated with the data to grow tree-based models that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. In this paper, two new splitting criteria are defined and analyzed theoretically. Effectiveness is tested on four experiments, ranging from simple, synthetic to complex, real-world applications. In particular we cast the OOD domain-adaptation problem in the context of financial markets, where the new models out-perform state-of-the-art GBDT models on the Numerai data set. The new criteria are incorporated into the Scikit-Learn code base and made freely available online.

Era Splitting: Invariant Learning for Decision Trees

TL;DR

The paper tackles out-of-distribution generalization in regression tasks by introducing era-aware split criteria for gradient boosted decision trees. Era Splitting computes per-era impurity reductions and aggregates them with a Boltzmann operator to encourage invariance across eras, while Directional Era Splitting augments this with cross-era agreement of split directions. The authors validate the approach on four datasets—Shifted Sine Wave, Synthetic Memorization, Camelyon17, and Numerai—showing reduced in-sample overfitting and improved out-of-sample performance, with directional era splitting often delivering the best results. Public code releases enable practitioners to apply invariant learning concepts to tree ensembles in settings with clear domain shifts in tabular data.

Abstract

Real-life machine learning problems exhibit distributional shifts in the data from one time to another or from one place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate "environmental", or "era-wise" information into the algorithms. So far, most research has been focused on linear models and/or neural networks . In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, namely, gradient boosting decision trees (GBDTs). The new splitting criteria use era-wise information associated with the data to grow tree-based models that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. In this paper, two new splitting criteria are defined and analyzed theoretically. Effectiveness is tested on four experiments, ranging from simple, synthetic to complex, real-world applications. In particular we cast the OOD domain-adaptation problem in the context of financial markets, where the new models out-perform state-of-the-art GBDT models on the Numerai data set. The new criteria are incorporated into the Scikit-Learn code base and made freely available online.
Paper Structure (17 sections, 5 theorems, 41 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 5 theorems, 41 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

The value of $y(t)$ that minimizes $R(d)$ is the average of $y_n$ for all cases $(x_i, y_i)$ falling into $t$; that is, the minimizing $y(t)$ is where the sum is over all $y_i$ such that $x_i \in t$ and $N(t)$ is the total number of cases in $t$.

Figures (9)

  • Figure 1: Financial data experience distribution shifts over time due to changing macro-economic variables and current events. OOD algorithms endeavor to uncover signals which are present in all eras (environments) instead of spurious signals only present in some.
  • Figure 2: Figure 8.2 Breiman1984, an example tree.
  • Figure 3: Degenerate split decisions induced by the traditional splitting criterion (Eq. \ref{['original_split_criterion']}) when viewed from an OOD setting. Each plot displays example data with 2 features from two eras (environments). The y-axis plots the gradients corresponding to each data point. On the left, the original splitting criterion chooses a split which doesn't improve impurity in any era. The era splitting criterion (Eq. \ref{['era_split_criterion']}) chooses a split that improves impurity in both eras. On the right the original criterion chooses a split which results in conflicting directions, while directional era splitting (Eq. \ref{['eq:dir-era-split']}) chooses a split resulting in consistent directions in each era.
  • Figure 4: A schematic example of splitting data at each tree node. The target values are stored inside each data point. The original setting pools all the data together. Era splitting computes split scores on a per era basis. The value of the child nodes, the directions of the splits and the original and era split criteria scores are indicated. Notice era splitting does not choose the same split as the original, since it wouldn't improve impurity in any era.
  • Figure 5: A visual description of the data generation process for the shifted sine wave data set. Each era of training data starts with a sine wave (green), adds a random vertical shift and a random blur (Gaussian noise).
  • ...and 4 more figures

Theorems & Definitions (17)

  • Definition 1: Mean Squared Error, Breiman1984
  • Proposition 1: Proposition 8.10, Breiman1984
  • Definition 2: Definition 8.13, Breiman1984
  • Definition 3: Original Split Criterion, XGBoost
  • Definition 4: Era (Environment) peters2015causal
  • Definition 5: The Boltzmann Operator
  • Proposition 2: Limit of the Boltzmann Operator
  • proof
  • Definition 6: Era Split Criterion
  • Definition 7
  • ...and 7 more