BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
TL;DR
BEEM tackles latency-accuracy trade-offs in early exit DNNs by treating each intermediate exit as an expert and aggregating their confidence through a weighted ensemble with neighbor-consistency: exits accumulate a weighted score $S_i$ only when neighboring predictions agree, and an exit occurs when $S_i\geq α$. The method uses two weighting schemes, BEEM-C (cost-based) and BEEM-A (accuracy-based), and determines exit thresholds via optimization that leverages exit error rates to outperform the final classifier, supported by a theoretical bound on BEEM's error rate. Empirical results on GLUE and COCO show consistent speedups of $1.5\times$–$2.1\times$ with accuracy near or above the final layer, especially for easier NLP tasks; BEEM-A often yields the best trade-offs. The work demonstrates that incorporating ensemble-like information from multiple exits and carefully selecting thresholds can substantially improve the practicality of early exit DNNs in real-world NLP and vision-language tasks, with scalable benefits for larger models.
Abstract
Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at https://github.com/Div290/BEEM1/tree/main
