Table of Contents
Fetching ...

Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting

Hongyi Li, Han Lin, Jun Xu

TL;DR

The paper tackles learning oblique splits in regression trees, a problem with NP-hard optimal solutions. It introduces the Hinge Regression Tree (HRT), which reframes each node split as a nonlinear least-squares problem over two linear predictors connected by a hinge, yielding ReLU-like expressivity through a damped Newton (Gauss-Newton) optimization with alternating partitions. The authors establish monotone descent and convergence at the node level under backtracking line search, prove a universal $O(\delta^2)$ approximation rate for the resulting piecewise-linear class, and demonstrate that HRT achieves competitive or superior accuracy with markedly shallower trees on synthetic and real-world regression tasks. The approach offers robustness via ridge regularization and demonstrates practical scalability, making oblique single-tree models both effective and interpretable for nonlinear function approximation.

Abstract

Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT's model class is a universal approximator with an explicit $O(δ^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.

Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting

TL;DR

The paper tackles learning oblique splits in regression trees, a problem with NP-hard optimal solutions. It introduces the Hinge Regression Tree (HRT), which reframes each node split as a nonlinear least-squares problem over two linear predictors connected by a hinge, yielding ReLU-like expressivity through a damped Newton (Gauss-Newton) optimization with alternating partitions. The authors establish monotone descent and convergence at the node level under backtracking line search, prove a universal approximation rate for the resulting piecewise-linear class, and demonstrate that HRT achieves competitive or superior accuracy with markedly shallower trees on synthetic and real-world regression tasks. The approach offers robustness via ridge regularization and demonstrates practical scalability, making oblique single-tree models both effective and interpretable for nonlinear function approximation.

Abstract

Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT's model class is a universal approximator with an explicit approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
Paper Structure (50 sections, 4 theorems, 69 equations, 4 figures, 15 tables, 4 algorithms)

This paper contains 50 sections, 4 theorems, 69 equations, 4 figures, 15 tables, 4 algorithms.

Key Result

Theorem 1

Let $\mathcal{F}$ be the class of piecewise linear functions represented by finite oblique regression trees with linear models at the leaves. Let $g: \mathcal{K} \to \mathbb{R}$ be a twice continuously differentiable function ($g \in C^2(\mathcal{K})$) on a compact set $\mathcal{K} \subset \mathbb{R This approximation rate directly implies the universal approximation property, i.e., for any $\epsi

Figures (4)

  • Figure 1: Node-level convergence analysis on the unstable sinc function. Left: Objective value per iteration for a single internal node (fixed initialization and data subset). The unit step ($\mu=1.0$, blue) and a large step ($\mu=0.5$, orange) do not decrease the objective monotonically in this example, and the latter gets trapped in a limit cycle. Smaller damping ($\mu=0.1$, green; $\mu=0.05$, red) yields much more regular local Newton dynamics at this node. Right: Final fitted models for this controlled experiment. With large steps the node effectively collapses to a poor single linear fit, whereas sufficiently damped updates recover a meaningful piecewise linear approximation. Note that this figure illustrates local node-level behaviour; full-tree performance with fallback is analysed in Appendix \ref{['apendix_fallback']}.
  • Figure 2: Node-level convergence analysis on the well-behaved twisted_sigmoid function. Left: For this node, all step sizes lead to monotone decrease of the objective. The unit Newton step ($\mu=1.0$, blue) reaches the local minimum in the fewest iterations. Right: All step sizes arrive at essentially the same high-quality piecewise linear fit around the function's inflection point. This illustrates that, on stable problems, even aggressive Newton steps can behave well at the node level. Global training behaviour across all nodes and datasets is reported in Tables \ref{['tab:ablation_sinc']}--\ref{['tab:ablation_delta_ailerons']}.
  • Figure 3: Top: Approximation performance of various methods on sinc and twisted_sigmoid functions. Bottom: Residuals $r = y_{\text{pred}} - y_{\text{true}}$, representing the difference between predicted and true values for each method.
  • Figure 4: 3D function approximation. We visualize the learned piecewise linear surface of our method along with training/testing points. Contours on the floor help read depth. For clarity, only the fitted surface of Our Method is shown; detailed quantitative comparisons with baselines are provided in Table \ref{['tab:synthetic_performance_combined']}.

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2: Node-level convergence of the line-search Newton update
  • proof
  • Theorem 1
  • proof
  • Corollary 1: Curse of Dimensionality