Table of Contents
Fetching ...

Tricks and Plug-ins for Gradient Boosting with Transformers

Biyi Fang, Truong Vo, Jean Utke, Diego Klabjan

TL;DR

This work addresses the high resource cost and tuning burden of transformer models by introducing BoostTransformer, a boosting-based framework that pairs transformers with additive weak learners trained on token-subset representations. The core idea combines a least-squares objective with an ensemble $f(x)=\sum_{t=1}^N \alpha_t g_t(x)$ and adds three variants: Subsequence BoostTransformer, which uses attention-guided token pruning; and Importance-sampling BoostTransformer, which samples data according to boosting weights to improve efficiency and generalization. The authors provide theoretical insight that the optimal sampling distribution is proportional to the boosting weights or residual norms and demonstrate empirical gains on IMDB, Yelp, and Amazon datasets, with notable reductions in training time. Overall, the approach delivers faster convergence, robustness, and reduced architectural search overhead, making transformer-based NLP more scalable for fine-grained tasks under resource constraints.

Abstract

Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

Tricks and Plug-ins for Gradient Boosting with Transformers

TL;DR

This work addresses the high resource cost and tuning burden of transformer models by introducing BoostTransformer, a boosting-based framework that pairs transformers with additive weak learners trained on token-subset representations. The core idea combines a least-squares objective with an ensemble and adds three variants: Subsequence BoostTransformer, which uses attention-guided token pruning; and Importance-sampling BoostTransformer, which samples data according to boosting weights to improve efficiency and generalization. The authors provide theoretical insight that the optimal sampling distribution is proportional to the boosting weights or residual norms and demonstrate empirical gains on IMDB, Yelp, and Amazon datasets, with notable reductions in training time. Overall, the approach delivers faster convergence, robustness, and reduced architectural search overhead, making transformer-based NLP more scalable for fine-grained tasks under resource constraints.

Abstract

Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

Paper Structure

This paper contains 8 sections, 1 theorem, 27 equations, 9 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

In $\max_{P_{t}}$$\mathbb{E}_{P_{t}}\left[ \Delta^{(t)}\right]$, the optimal distribution for importance-sampling-based BoostTransformer to select each sample $i$ is proportional to its “boosting weight norm:” , i.e. (importance sample distribution).

Figures (9)

  • Figure 1: Relative Accuracy on IMDB
  • Figure 2: Improvement on IMDB
  • Figure 3: Relative Accuracy on Yelp
  • Figure 4: Improvement on Yelp
  • Figure 5: Relative Accuracy on Amazon
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1
  • proof
  • proof