Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abhijit Chowdhary; Elizabeth Newman; Deepanshu Verma

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abhijit Chowdhary, Elizabeth Newman, Deepanshu Verma

Abstract

Gradient boosting, a method of building additive ensembles from weak learners, has established itself as a practical and theoretically-motivated approach to approximate functions, especially using decision tree weak learners. Comparable methods for smooth parametric learners, such as neural networks, remain less developed in both training methodology and theory. To this end, we introduce \texttt{VPBoost} ({\bf V}ariable {\bf P}rojection {\bf Boost}ing), a gradient boosting algorithm for separable smooth approximators, i.e., models with a smooth nonlinear featurizer followed by a final linear mapping. \texttt{VPBoost} fuses variable projection, a training paradigm for separable models that enforces optimality of the linear weights, with a second-order weak learning strategy. The combination of second-order boosting, separable models, and variable projection give rise to a closed-form solution for the optimal linear weights and a natural interpretation of \VPBoost as a functional trust-region method. We thereby leverage trust-region theory to prove \VPBoost converges to a stationary point under mild geometric conditions and, under stronger assumptions, achieves a superlinear convergence rate. Comprehensive numerical experiments on synthetic data, image recognition, and scientific machine learning benchmarks demonstrate that \VPBoost learns an ensemble with improved evaluation metrics in comparison to gradient-descent-based boosting and attains competitive performance relative to an industry-standard decision tree boosting algorithm.

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abstract

Paper Structure (53 sections, 21 theorems, 110 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 21 theorems, 110 equations, 14 figures, 3 tables, 2 algorithms.

Introduction
Literature Review
Second-Order Boosting
Boosting with Optimal Linear Weights
Theoretical Guarantees for Boosting
Background
Gradient Boosting 101
Separable Weak Learners
Optimizing Separable Weak Learners with Variable Projection (VarPro)
A Note on Notation
VarPro Gradient Boosting (VPBoost)
Training Weak Learners with VarPro
VPBoost: A Functional Trust-Region Perspective
VPBoost Convergence Analysis
VPBoost Guarantees Descent in Function Space
...and 38 more sections

Key Result

Lemma 2

Given Bochner space $\mathcal{B}$ with induced operator norm $\|A_{\text{$\boldsymbol{\theta}$}}\|_{\mathcal{B}} \coloneqq \sup_{\|\mathbf{w}\|_2=1} \|A_{\text{$\boldsymbol{\theta}$}}(\cdot)\mathbf{w}\|_{\mathcal{F}}$,

Figures (14)

Figure 1: Weak learners progressively capture higher frequencies.
Figure 2: Trajectories of gradient descent (GD), alternating directions (AD), and variable projection (VP). Geometrically, VP iterates traverse the curve $\mathbf{w}_{\star}(\text{$\boldsymbol{\theta}$})$.
Figure 3: Reduction ratio cutoffs and regularization parameter update in Algorithm \ref{['alg:trust_region_skeleton_vpboost']}.
Figure 4: Illustration of the VPBoost bridge between function space $\mathcal{F}$ (left) and parameter space $\mathbb{R}^{n_{w}} \times \mathbb{R}^{n_{\theta}}$ (right). The dashed region is the subset of $\mathcal{F}$ containing separable weak learners with specific featurizer architecture, $A_{\text{$\boldsymbol{\theta}$}}$. Each colorful ellipse represents a linear subspace of $\mathcal{F}$ for fixed $\text{$\boldsymbol{\theta}$} \in \mathbb{R}^{n_{\theta}}$ defined as $\mathcal{F}_{\rm sep}(\text{$\boldsymbol{\theta}$}) = \{A_{\text{$\boldsymbol{\theta}$}}(\cdot) \mathbf{w} \in \mathcal{F} \mid \mathbf{w}\in \mathbb{R}^{n_{w}}\}$. As VarPro traverses along $\mathbf{w}_{\star}(\text{$\boldsymbol{\theta}$})$ in parameter space and progresses toward a minimum (colorful circles along yellow curve), the corresponding linear subspaces $\mathcal{F}_{\rm sep}(\text{$\boldsymbol{\theta}$})$ evolve simultaneously in function space and progress toward the target, $f^* - f^{(m)}$.
Figure 5: Fully-connected NN with two hidden layers. Thicker blue arrows indicate the final linear map.
...and 9 more figures

Theorems & Definitions (33)

Lemma 2
Lemma 3: Gradient of Reduced Objective Function $\Reduced{\SAA\ObjFctn}$
Remark 4
Remark 5
Remark 6
Lemma 7: VarPro Guarantees Descent
Remark 9
Definition 10: Cauchy Point
Lemma 11: VarPro vs. Cauchy Model Reduction
Lemma 12: VarPro Model Reduction Lower Bound
...and 23 more

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Abstract

Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection

Authors

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (33)