Condensed Gradient Boosting
Seyedsaman Emami, Gonzalo Martínez-Muñoz
TL;DR
The paper addresses the computational burden of standard gradient boosting in multi-class classification and multi-output regression, where one tree per class per iteration is typically trained. It introduces Condensed Gradient Boosting (C-GB), which uses a single multi-output tree with vector-valued leaves and a two-step optimization: first fitting the base learner to pseudo-residuals via least-squares, then applying a Newton-Raphson refinement to update leaf outputs. Through extensive experiments on 12 multi-class and 3 multi-output regression datasets, C-GB shows comparable or improved generalization relative to standard GB while reducing training and prediction times, and it often outperforms competing multi-output approaches like TFBT and GBDT-MO. The results suggest substantial reductions in ensemble complexity with preserved accuracy, highlighting C-GB’s practical appeal for large-scale, multi-target problems, and the work provides an open-source implementation for broader use and extension. $\hat{\mathbf{F}}_m(\mathbf{x}) = \hat{\mathbf{F}}_{m-1}(\mathbf{x}) + \nu \tilde{\mathbf{h}}_m(\mathbf{x})$, with $\tilde{\mathbf{h}}_m(\mathbf{x}) = \{\gamma_{\{k,m\}} h_{\{k,m\}}(\mathbf{x})\}_{k=1}^K$ in the multi-output setting.$
Abstract
This paper presents a computationally efficient variant of gradient boosting for multi-class classification and multi-output regression tasks. Standard gradient boosting uses a 1-vs-all strategy for classifications tasks with more than two classes. This strategy translates in that one tree per class and iteration has to be trained. In this work, we propose the use of multi-output regressors as base models to handle the multi-class problem as a single task. In addition, the proposed modification allows the model to learn multi-output regression problems. An extensive comparison with other multi-ouptut based gradient boosting methods is carried out in terms of generalization and computational efficiency. The proposed method showed the best trade-off between generalization ability and training and predictions speeds.
