Table of Contents
Fetching ...

Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal

TL;DR

This work provides the first large-scale empirical comparison of six LP-based totally corrective boosting formulations, including two novel methods NM-Boost and QRLP-Boost, against leading heuristic baselines across 20 datasets. It shows that totally corrective methods can outperform or match state-of-the-art heuristics when using shallow trees, delivering significantly sparser ensembles, and can also thin pre-trained ensembles without loss of performance. The study analyzes not only accuracy but also margin distributions, anytime behavior, hyperparameter sensitivity, and reweighting dynamics, offering practical guidance for interpretable, efficient ensemble design. It also demonstrates that optimal decision trees offer dataset-dependent gains with no consistent sparsity advantage, highlighting the nuanced trade-offs between base-learner strength, diversity, and ensemble sparsity.

Abstract

Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

TL;DR

This work provides the first large-scale empirical comparison of six LP-based totally corrective boosting formulations, including two novel methods NM-Boost and QRLP-Boost, against leading heuristic baselines across 20 datasets. It shows that totally corrective methods can outperform or match state-of-the-art heuristics when using shallow trees, delivering significantly sparser ensembles, and can also thin pre-trained ensembles without loss of performance. The study analyzes not only accuracy but also margin distributions, anytime behavior, hyperparameter sensitivity, and reweighting dynamics, offering practical guidance for interpretable, efficient ensemble design. It also demonstrates that optimal decision trees offer dataset-dependent gains with no consistent sparsity advantage, highlighting the nuanced trade-offs between base-learner strength, diversity, and ensemble sparsity.

Abstract

Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

Paper Structure

This paper contains 32 sections, 15 equations, 17 figures, 20 tables, 1 algorithm.

Figures (17)

  • Figure 1: Average testing accuracy compared to average ensemble sparsity over all datasets for CART decision trees of depth 1, 3, 5, and 10.
  • Figure 2: Anytime behavior on the image dataset for selected methods (all other methods in gray), for CART decision trees of depth 1, 3, 5, and 10. Error bars indicate the standard deviation over 5 seeds.
  • Figure 3: Anytime behavior on the ringnorm dataset for selected methods (all other methods in gray), for CART decision trees of depth 1, 3, 5, and 10. Error bars indicate the standard deviation over 5 seeds.
  • Figure 4: Ensemble weights for four datasets (top to bottom: breast cancer, german credit, splice, and twonorm) and depth 1 CART trees for totally corrective methods and Adaboost, over 5 seeds. Bold highlights the best overall accuracy, while a star$^*$ marks the best among totally corrective methods.
  • Figure 5: Test data margin distribution for german credit (left column plots) and ringnorm (right column plots) datasets using CART trees of depths 1, 3, and 5, over 5 seeds. Bold highlights the best overall accuracy, while a star$^*$ marks the best among totally corrective methods.
  • ...and 12 more figures