The Cost of Parallelizing Boosting
Xin Lyu, Hongxun Wu, Junzhao Yang
TL;DR
This work establishes foundational limits and a constructive trade-off for parallelizing boosting. It proves a tight lower bound showing that slight parallelization cannot avoid an exponential training blow-up unless the algorithm tolerates many rounds, formalized as either Ω(1/γ^2) rounds or exp(d) growth, with a refined bound using exp(d) rather than exp(d/γ). It also presents a Few Rounds Boosting algorithm that leverages bagging to achieve a tunable balance between rounds and total weak-learnner calls, demonstrating a concrete p–t trade-off and showing that reduced rounds can be achieved at the cost of exp(d t^2) growth in work. Collectively, these results quantify the inherent cost of parallelizing boosting and provide a concrete framework to trade parallel queries against total computation, informing both theory and practice of scalable boosting. The approach blends coin-problem based lower bounds, differential-privacy inspired composition, and bagging-inspired parallelism to yield the first rigorous, smooth trade-off between rounds and total work in boosting.
Abstract
We study the cost of parallelizing weak-to-strong boosting algorithms for learning, following the recent work of Karbasi and Larsen. Our main results are two-fold: - First, we prove a tight lower bound, showing that even "slight" parallelization of boosting requires an exponential blow-up in the complexity of training. Specifically, let $γ$ be the weak learner's advantage over random guessing. The famous \textsc{AdaBoost} algorithm produces an accurate hypothesis by interacting with the weak learner for $\tilde{O}(1 / γ^2)$ rounds where each round runs in polynomial time. Karbasi and Larsen showed that "significant" parallelization must incur exponential blow-up: Any boosting algorithm either interacts with the weak learner for $Ω(1 / γ)$ rounds or incurs an $\exp(d / γ)$ blow-up in the complexity of training, where $d$ is the VC dimension of the hypothesis class. We close the gap by showing that any boosting algorithm either has $Ω(1 / γ^2)$ rounds of interaction or incurs a smaller exponential blow-up of $\exp(d)$. -Complementing our lower bound, we show that there exists a boosting algorithm using $\tilde{O}(1/(t γ^2))$ rounds, and only suffer a blow-up of $\exp(d \cdot t^2)$. Plugging in $t = ω(1)$, this shows that the smaller blow-up in our lower bound is tight. More interestingly, this provides the first trade-off between the parallelism and the total work required for boosting.
