Simple Linear Neuron Boosting
Daniel Munoz
TL;DR
The paper introduces Linear Neuron Boosting (LNB), a function-space optimization technique that preconditions gradient updates for linear neurons by whitening their input features. By formulating the per-neuron metric $M_i = \mathbb{E}_{x_{i-1}\sim X_i}[ (\partial x_i / \partial \theta_i)^T (\partial x_i / \partial \theta_i) ]$ and solving $M_i \hat{\theta}_i = g_i$, LNB achieves a preconditioned, matrix-free update $\hat{\theta}_B$ that is equivalent to a whitening reparameterization and can be implemented with autodifferentiation across architectures. The method supports online learning via EMA estimates and limited Conjugate Gradient iterations, with a straightforward interpretation as feature whitening preceding linear transforms. Empirical results across matrix factorization, MLPs, Vision Transformers, and UNet demonstrate faster convergence in epochs and competitive wall-clock time relative to Adam, highlighting LNB as a practical, architecture-agnostic preconditioning technique. The work situates LNB among function-space optimizations and connects it to whitening and second-order methods, offering a scalable, easy-to-implement alternative for improving training dynamics in deep networks.
Abstract
Given a differentiable network architecture and loss function, we revisit optimizing the network's neurons in function space using Boosted Backpropagation (Grubb & Bagnell, 2010), in contrast to optimizing in parameter space. From this perspective, we reduce descent in the space of linear functions that optimizes the network's backpropagated-errors to a preconditioned gradient descent algorithm. We show that this preconditioned update rule is equivalent to reparameterizing the network to whiten each neuron's features, with the benefit that the normalization occurs outside of inference. In practice, we use this equivalence to construct an online estimator for approximating the preconditioner and we propose an online, matrix-free learning algorithm with adaptive step sizes. The algorithm is applicable whenever autodifferentiation is available, including convolutional networks and transformers, and it is simple to implement for both the local and distributed training settings. We demonstrate fast convergence both in terms of epochs and wall clock time on a variety of tasks and networks.
