Table of Contents
Fetching ...

Jacobian Aligned Random Forests

Sarwesh Rauniyar

TL;DR

JARF introduces a single EJOP/EGOP-based global preconditioner that rotates and scales input features to align predictive directions with axis-aligned splits. By applying this preconditioning before standard random forests or boosted trees, JARF captures oblique decision boundaries with minimal changes to existing training pipelines. The approach is supported by theory linking EJOP to CART impurity gains and by empirical results showing competitive or superior performance to oblique forests across classification and regression benchmarks, with favorable training times. The work highlights supervised, gradient-based geometry as a robust, model-agnostic means to enhance tabular tree ensembles while preserving their speed and robustness.

Abstract

Axis-aligned decision trees are fast and stable but struggle on datasets with rotated or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests address this with per-node hyperplane splits, but at added computational cost and implementation complexity. We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we first fit an axis-aligned forest to estimate class probabilities or regression outputs, compute finite-difference gradients of these predictions with respect to each feature, aggregate them into an expected Jacobian outer product that generalizes the expected gradient outer product (EGOP), and use it as a single global linear preconditioner for all inputs. This supervised preconditioner applies a single global rotation of the feature space, then hands the transformed data back to a standard axis-aligned forest, preserving off-the-shelf training pipelines while capturing oblique boundaries and feature interactions that would otherwise require many axis-aligned splits to approximate. The same construction applies to any model that provides gradients, though we focus on random forests and gradient-boosted trees in this work. On tabular classification and regression benchmarks, this preconditioning consistently improves axis-aligned forests and often matches or surpasses oblique baselines while improving training time. Our experimental results and theoretical analysis together indicate that supervised preconditioning can recover much of the accuracy of oblique forests while retaining the simplicity and robustness of axis-aligned trees.

Jacobian Aligned Random Forests

TL;DR

JARF introduces a single EJOP/EGOP-based global preconditioner that rotates and scales input features to align predictive directions with axis-aligned splits. By applying this preconditioning before standard random forests or boosted trees, JARF captures oblique decision boundaries with minimal changes to existing training pipelines. The approach is supported by theory linking EJOP to CART impurity gains and by empirical results showing competitive or superior performance to oblique forests across classification and regression benchmarks, with favorable training times. The work highlights supervised, gradient-based geometry as a robust, model-agnostic means to enhance tabular tree ensembles while preserving their speed and robustness.

Abstract

Axis-aligned decision trees are fast and stable but struggle on datasets with rotated or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests address this with per-node hyperplane splits, but at added computational cost and implementation complexity. We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we first fit an axis-aligned forest to estimate class probabilities or regression outputs, compute finite-difference gradients of these predictions with respect to each feature, aggregate them into an expected Jacobian outer product that generalizes the expected gradient outer product (EGOP), and use it as a single global linear preconditioner for all inputs. This supervised preconditioner applies a single global rotation of the feature space, then hands the transformed data back to a standard axis-aligned forest, preserving off-the-shelf training pipelines while capturing oblique boundaries and feature interactions that would otherwise require many axis-aligned splits to approximate. The same construction applies to any model that provides gradients, though we focus on random forests and gradient-boosted trees in this work. On tabular classification and regression benchmarks, this preconditioning consistently improves axis-aligned forests and often matches or surpasses oblique baselines while improving training time. Our experimental results and theoretical analysis together indicate that supervised preconditioning can recover much of the accuracy of oblique forests while retaining the simplicity and robustness of axis-aligned trees.

Paper Structure

This paper contains 64 sections, 7 theorems, 73 equations, 4 figures, 11 tables.

Key Result

Lemma 1

Fix a class $c$. Let $f_c : \mathbb{R}^d \to \mathbb{R}$ be $C^3$ in a neighborhood of $x$, and assume all third directional derivatives along the coordinate axes are bounded there: Define the centered finite–difference (FD) of the $j$th partial derivative at $x$ with step size $\varepsilon > 0$ by Then for each coordinate $j$, Consequently, if $\|\nabla f_c(x)\|_2\le M$ and $G^{\mathrm{FD}}(c)

Figures (4)

  • Figure 1: Cohen’s $\kappa$ versus rotation angle $\theta$ for RF, RotF, CCF, SPORF, JARF, XGB, and the PCA+RF and LDA+RF baselines on the simulated rotated hyperplane problem. JARF attains the highest $\kappa$ at moderate and large rotations, while PCA+RF and LDA+RF offer only modest gains over RF and all axis aligned methods (RF, XGB, PCA+RF, LDA+RF) degrade more quickly than the oblique forests as $\theta$ increases.
  • Figure 2: Beeswarm of effect size relative to RF on real data. Each marker is one dataset in the 15-task suite. The vertical axis shows the per-dataset effect size $\Delta(A)=\kappa(\mathrm{RF})-\kappa(A)$; the dashed line marks parity with RF ($\Delta{=}0$). Points below the line indicate the method outperforms RF. JARF produces mostly negative deltas and achieves the best overall rank in Table \ref{['tab:real-main']}, while oblique baselines (RotF, CCF, SPORF) show mixed but generally favorable improvements over RF.
  • Figure 3: Comparison of median training times on the 20 real-data tasks. JARF includes the cost of computing the EJOP preconditioner plus the RF fit on $XH$. Measured times: RF = 15 s, JARF = 25 s, RotF = 60 s, CCF = 44 s, SPORF = 45 s, XGB = 43 s. JARF adds $\sim$10 s over RF ($\approx\!1.67\times$ RF cost) yet remains faster than per-node oblique forests.
  • Figure 4: Alignment growth with EJOP subspace size. Median $s_k = \|U_k^\top \tilde{n}\|_2^2$ versus $k$ for RotF/CCF/SPORF. Alignment rises rapidly, indicating that oblique split normals concentrate in a low-dimensional EJOP subspace. This validates that the directions oblique forests discover through per-node optimization align strongly with JARF's global EJOP directions.

Theorems & Definitions (13)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1: Dimension-adapted risk bound for JARF
  • Lemma 3
  • proof
  • Lemma 4: Subspace perturbation
  • proof
  • Lemma 5: Projection error
  • ...and 3 more