Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Mehrnaz Mofakhami; Reza Bayat; Ioannis Mitliagkas; Joao Monteiro; Valentina Zantedeschi

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Mehrnaz Mofakhami, Reza Bayat, Ioannis Mitliagkas, Joao Monteiro, Valentina Zantedeschi

TL;DR

This work tackles deploying large models under fixed compute by leveraging Early Exiting (EE) while addressing miscalibration that undermines exit decisions. It introduces Performance Control Early Exiting (PCEE), which uses a single threshold δ derived from reliability diagrams to decide exits based on the average accuracy of similarly confident samples, with a smoothing variant PCEE-WS. Empirically, larger models with EE yield lower prediction errors at the same compute compared to smaller full models, and PCEE/PCEE-WS provide superior control over the accuracy-cost trade-off across MSDNet and ViT on CIFAR-10/100 and ImageNet. While the approach improves practicality and scalability of large-model inference, the paper notes limitations under distribution shift and highlights avenues for future work such as online reliability mapping and rejection mechanisms.

Abstract

Early Exiting (EE) is a promising technique for speeding up inference by adaptively allocating compute resources to data points based on their difficulty. The approach enables predictions to exit at earlier layers for simpler samples while reserving more computation for challenging ones. In this study, we first present a novel perspective on the EE approach, showing that larger models deployed with EE can achieve higher performance than smaller models while maintaining similar computational costs. As existing EE approaches rely on confidence estimation at each exit point, we further study the impact of overconfidence on the controllability of the compute-performance trade-off. We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a data point's confidence but on the average accuracy of samples with similar confidence levels from a held-out validation set. In our experiments, we show that PCEE offers a simple yet computationally efficient approach that provides better control over performance than standard confidence-based approaches, and allows us to scale up model sizes to yield performance gain while reducing the computational cost.

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 15 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 1 equation, 15 figures, 9 tables, 1 algorithm.

Introduction
Contributions
Background and Setting
Early Exit Neural Networks
Calibration and Expected Calibration Error (ECE)
Benefits of Increasing Model Size Coupled with Early Exiting
Performance Control Early Exiting
Checking for Miscalibration in Early Exit Neural Networks
Performance Control Early Exiting (PCEE)
PCEE-WS
Implementation and Training details
Experiments
Baselines
Performance Control
Effect of Calibration
...and 12 more sections

Figures (15)

Figure 1: Larger models coupled with early exiting can achieve lower prediction errors for the same computational budget compared to smaller models. This plot shows prediction error (%) versus average flops used for different MSDNet sizes on CIFAR-10: small (4 layers) and large (8 layers). Various exiting strategies are compared: ours (PCEE, PCEE-WS) and Oracle (exiting as soon as a layer's prediction matches that of the final layer). Each green and yellow dot corresponds to a model seed and a threshold $\delta$. Oracle is computed by averaging over 3 seeds. The large model with any early-exiting strategy gets to lower prediction errors than the full small model with even less compute.
Figure 2: Heatmap of the layers used by an Oracle EE strategy of a ViT on $64$ random samples from ImageNet-1k. The dark bars indicate the layers used for each sample and the light-colored area shows the amount of compute that can be saved without losing performance.
Figure 3: Confidence levels across different layers of a ViT with layerwise classifiers trained on ImageNet-1k tested on the visually simple snake image shown on the plot. Red bars indicate layers that made incorrect predictions, while blue layers indicate layers that made correct predictions. Overconfident early layers trigger a (premature) exit on layer 5, the first layer surpassing the threshold of 0.75. The test accuracy for each layer is also shown.
Figure 4: Reliability Diagrams for Layers 1, 5, 8 of MSDNet-Large with 8 layers on CIFAR-100
Figure 5: PCEE: The structural overview of PCEE. In a multi-layer model with exit points at each layer, the input representation $r_{i}$ is processed through an exit layer block $E_i$. The exit layer calculates a confidence score $c_i$ and uses a reliability diagram (confidence-to-accuracy mapping) to determine whether to exit or continue processing. If the estimated accuracy from the reliability diagram exceeds an accuracy threshold $\delta$, the model exits and outputs prediction ${pred}_{i}$; otherwise, it proceeds to the next layer, passing the representation forward.
...and 10 more figures

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

TL;DR

Abstract

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Authors

TL;DR

Abstract

Table of Contents

Figures (15)