Table of Contents
Fetching ...

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abhijit Chowdhary, Elizabeth Newman, Deepanshu Verma

Abstract

Gradient boosting, a method of building additive ensembles from weak learners, has established itself as a practical and theoretically-motivated approach to approximate functions, especially using decision tree weak learners. Comparable methods for smooth parametric learners, such as neural networks, remain less developed in both training methodology and theory. To this end, we introduce \texttt{VPBoost} ({\bf V}ariable {\bf P}rojection {\bf Boost}ing), a gradient boosting algorithm for separable smooth approximators, i.e., models with a smooth nonlinear featurizer followed by a final linear mapping. \texttt{VPBoost} fuses variable projection, a training paradigm for separable models that enforces optimality of the linear weights, with a second-order weak learning strategy. The combination of second-order boosting, separable models, and variable projection give rise to a closed-form solution for the optimal linear weights and a natural interpretation of \VPBoost as a functional trust-region method. We thereby leverage trust-region theory to prove \VPBoost converges to a stationary point under mild geometric conditions and, under stronger assumptions, achieves a superlinear convergence rate. Comprehensive numerical experiments on synthetic data, image recognition, and scientific machine learning benchmarks demonstrate that \VPBoost learns an ensemble with improved evaluation metrics in comparison to gradient-descent-based boosting and attains competitive performance relative to an industry-standard decision tree boosting algorithm.

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abstract

Gradient boosting, a method of building additive ensembles from weak learners, has established itself as a practical and theoretically-motivated approach to approximate functions, especially using decision tree weak learners. Comparable methods for smooth parametric learners, such as neural networks, remain less developed in both training methodology and theory. To this end, we introduce \texttt{VPBoost} ({\bf V}ariable {\bf P}rojection {\bf Boost}ing), a gradient boosting algorithm for separable smooth approximators, i.e., models with a smooth nonlinear featurizer followed by a final linear mapping. \texttt{VPBoost} fuses variable projection, a training paradigm for separable models that enforces optimality of the linear weights, with a second-order weak learning strategy. The combination of second-order boosting, separable models, and variable projection give rise to a closed-form solution for the optimal linear weights and a natural interpretation of \VPBoost as a functional trust-region method. We thereby leverage trust-region theory to prove \VPBoost converges to a stationary point under mild geometric conditions and, under stronger assumptions, achieves a superlinear convergence rate. Comprehensive numerical experiments on synthetic data, image recognition, and scientific machine learning benchmarks demonstrate that \VPBoost learns an ensemble with improved evaluation metrics in comparison to gradient-descent-based boosting and attains competitive performance relative to an industry-standard decision tree boosting algorithm.
Paper Structure (53 sections, 21 theorems, 110 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 21 theorems, 110 equations, 14 figures, 3 tables, 2 algorithms.

Key Result

Lemma 2

Given Bochner space $\mathcal{B}$ with induced operator norm $\|A_{\text{$\boldsymbol{\theta}$}}\|_{\mathcal{B}} \coloneqq \sup_{\|\mathbf{w}\|_2=1} \|A_{\text{$\boldsymbol{\theta}$}}(\cdot)\mathbf{w}\|_{\mathcal{F}}$,

Figures (14)

  • Figure 1: Weak learners progressively capture higher frequencies.
  • Figure 2: Trajectories of gradient descent (GD), alternating directions (AD), and variable projection (VP). Geometrically, VP iterates traverse the curve $\mathbf{w}_{\star}(\text{$\boldsymbol{\theta}$})$.
  • Figure 3: Reduction ratio cutoffs and regularization parameter update in Algorithm \ref{['alg:trust_region_skeleton_vpboost']}.
  • Figure 4: Illustration of the VPBoost bridge between function space $\mathcal{F}$ (left) and parameter space $\mathbb{R}^{n_{w}} \times \mathbb{R}^{n_{\theta}}$ (right). The dashed region is the subset of $\mathcal{F}$ containing separable weak learners with specific featurizer architecture, $A_{\text{$\boldsymbol{\theta}$}}$. Each colorful ellipse represents a linear subspace of $\mathcal{F}$ for fixed $\text{$\boldsymbol{\theta}$} \in \mathbb{R}^{n_{\theta}}$ defined as $\mathcal{F}_{\rm sep}(\text{$\boldsymbol{\theta}$}) = \{A_{\text{$\boldsymbol{\theta}$}}(\cdot) \mathbf{w} \in \mathcal{F} \mid \mathbf{w}\in \mathbb{R}^{n_{w}}\}$. As VarPro traverses along $\mathbf{w}_{\star}(\text{$\boldsymbol{\theta}$})$ in parameter space and progresses toward a minimum (colorful circles along yellow curve), the corresponding linear subspaces $\mathcal{F}_{\rm sep}(\text{$\boldsymbol{\theta}$})$ evolve simultaneously in function space and progress toward the target, $f^* - f^{(m)}$.
  • Figure 5: Fully-connected NN with two hidden layers. Thicker blue arrows indicate the final linear map.
  • ...and 9 more figures

Theorems & Definitions (33)

  • Lemma 2
  • Lemma 3: Gradient of Reduced Objective Function $\Reduced{\SAA\ObjFctn}$
  • Remark 4
  • Remark 5
  • Remark 6
  • Lemma 7: VarPro Guarantees Descent
  • Remark 9
  • Definition 10: Cauchy Point
  • Lemma 11: VarPro vs. Cauchy Model Reduction
  • Lemma 12: VarPro Model Reduction Lower Bound
  • ...and 23 more