Tricks and Plug-ins for Gradient Boosting with Transformers
Biyi Fang, Truong Vo, Jean Utke, Diego Klabjan
TL;DR
This work addresses the high resource cost and tuning burden of transformer models by introducing BoostTransformer, a boosting-based framework that pairs transformers with additive weak learners trained on token-subset representations. The core idea combines a least-squares objective with an ensemble $f(x)=\sum_{t=1}^N \alpha_t g_t(x)$ and adds three variants: Subsequence BoostTransformer, which uses attention-guided token pruning; and Importance-sampling BoostTransformer, which samples data according to boosting weights to improve efficiency and generalization. The authors provide theoretical insight that the optimal sampling distribution is proportional to the boosting weights or residual norms and demonstrate empirical gains on IMDB, Yelp, and Amazon datasets, with notable reductions in training time. Overall, the approach delivers faster convergence, robustness, and reduced architectural search overhead, making transformer-based NLP more scalable for fine-grained tasks under resource constraints.
Abstract
Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.
