Table of Contents
Fetching ...

Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles

Fan Yang, Pierre Le Bodic, Michael Kamp, Mario Boley

TL;DR

This work tackles interpretability in additive rule ensembles by introducing Orthogonal Gradient Boosting (COB), which uses an objective that measures the angle between the risk gradient and the orthogonal component of a candidate rule's output to guide corrective updates. The method formalizes a corrective gradient descent within the span of previously selected rules and the gradient, and defines the orthogonal objective $\mathrm{obj}_{\mathrm{ogb}}(q) = \frac{|\mathbf{g}_\perp^T \mathbf{q}|}{(\|\mathbf{q}_\perp\| + \epsilon)}$ to promote more general, shorter rules while maintaining predictive accuracy. The authors develop efficient incremental algorithms to compute the objective and enable prefix-search strategies, including beam and branch-and-bound variants. Empirically, COB outperforms standard gradient boosting, gradient sum, XGBoost, and SIRUS across 34 datasets (classification, regression, Poisson) with complexity levels up to 50, achieving better risk/complexity trade-offs while keeping computation practical. The work also discusses extensions to extreme gradient boosting and the theoretical guarantees underpinning the orthogonal approach.

Abstract

Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximate the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.

Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles

TL;DR

This work tackles interpretability in additive rule ensembles by introducing Orthogonal Gradient Boosting (COB), which uses an objective that measures the angle between the risk gradient and the orthogonal component of a candidate rule's output to guide corrective updates. The method formalizes a corrective gradient descent within the span of previously selected rules and the gradient, and defines the orthogonal objective to promote more general, shorter rules while maintaining predictive accuracy. The authors develop efficient incremental algorithms to compute the objective and enable prefix-search strategies, including beam and branch-and-bound variants. Empirically, COB outperforms standard gradient boosting, gradient sum, XGBoost, and SIRUS across 34 datasets (classification, regression, Poisson) with complexity levels up to 50, achieving better risk/complexity trade-offs while keeping computation practical. The work also discusses extensions to extreme gradient boosting and the theoretical guarantees underpinning the orthogonal approach.

Abstract

Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximate the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.
Paper Structure (17 sections, 8 theorems, 39 equations, 11 figures, 5 tables, 3 algorithms)

This paper contains 17 sections, 8 theorems, 39 equations, 11 figures, 5 tables, 3 algorithms.

Key Result

Proposition 0

Let $\mathbf{f}^\mathrm{CGD}=\mathop{\mathrm{arg\,min}}\limits\{R_\lambda(\mathbf{f}') \!:\, \mathbf{f}' \in \mathrm{range} [\mathbf{Q}_{t-1}; \mathbf{g}]\}$ be the output vector of the ideal corrective gradient descent update in round $t$ and where for $\mathbf{v} \in \mathbb{R}^n$ we denote by $\mathbf{v}_\perp$ its projection onto the orthogonal complement of $\mathrm{range} \, \mathbf{Q}_{t-1

Figures (11)

  • Figure 1: A regression example with three data points with target values $\mathbf{y}=(-10, -6, 5)$ and three queries with outputs $\mathbf{q}_1=(1, 1, 0)$, i.e., $q_1$ selects the first two data points, $\mathbf{q}_2=(0, 0, 1)$, and $\mathbf{q}_3=(0, 1, 1)$. Gradient boosting selects $q_1$ with weight $\beta_1=-8$ as first rule, resulting in negative gradient $-\mathbf{g}=(-2, 2, 5)$. Left: The input space and the rule ensembles generated by CGB/GB and CGD. The CGB method generates output $(-8, -8, 5)$, and CGD generates output ${-10.3, -5.6, 4.7}$. Middle: Approximations to target subspace (blue) spanned by $\mathbf{q}_1$ and $-\mathbf{g}$. The subspace (green) spanned by $\mathbf{q}_3$ and $\mathbf{q}_1$ is a better approximation than the subspace (orange) selected by standard gradient boosting (spanned by $\mathbf{q}_2$ and $\mathbf{q}_1$). Right: After projection onto orthogonal complement of already selected query, angle between $\mathbf{q}_3$ and $-\mathbf{g}$ is smaller than that between $\mathbf{q}_2$ and $-\mathbf{g}$ and is thus successfully selected by orthogonal gradient boosting objective.
  • Figure 2: Method log risks across complexity levels (top) and log test risks across datasets (bottom).
  • Figure 3: Comparison of risks with different complexity levels between Stepwise Gradient Boosting (SGB), Corrective Gradient Boosting (CGB), COB using Greedy search and COB using Branch-and-bound search. The colours represent the complexity of the rule ensembles.
  • Figure 4: Coverage rate of the rules generated by Gradient Boosting, XGBoost, Gradient Sum versus OGB.
  • Figure 5: Running time ratio of SXB and COB (top) and naive and efficient opt. of COB (bottom).
  • ...and 6 more figures

Theorems & Definitions (15)

  • Proposition 0
  • Proposition 0
  • proof : Proof sketch
  • Proposition 0
  • proof : Proof sketch
  • Proposition 0
  • proof
  • Proposition 0
  • proof
  • Proposition 0
  • ...and 5 more