Table of Contents
Fetching ...

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, Cen Chen

TL;DR

ConsistentEE addresses the training-inference mismatch in early exiting for large language models by casting the exit decision as a reinforcement learning problem with a per-layer policy network. It introduces Memorized Layer to quantify instance hardness and to adapt the reward, enabling easy instances to accelerate aggressively while hard instances prioritize accuracy. The method achieves substantial acceleration with minimal or no accuracy loss on classification tasks and yields favorable generation quality at higher speedups, outperforming several baselines across multiple backbones. The approach demonstrates strong potential for practical, efficient inference in both understanding and generation settings. The accompanying code base supports replication and adaptation to new PLMs.

Abstract

Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only require each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept Memorize Layer to measure the hardness of an instance. We incorporate memorized layer into reward function design, which allows "easy" instances to focus more on acceleration while "hard" instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks.

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

TL;DR

ConsistentEE addresses the training-inference mismatch in early exiting for large language models by casting the exit decision as a reinforcement learning problem with a per-layer policy network. It introduces Memorized Layer to quantify instance hardness and to adapt the reward, enabling easy instances to accelerate aggressively while hard instances prioritize accuracy. The method achieves substantial acceleration with minimal or no accuracy loss on classification tasks and yields favorable generation quality at higher speedups, outperforming several baselines across multiple backbones. The approach demonstrates strong potential for practical, efficient inference in both understanding and generation settings. The accompanying code base supports replication and adaptation to new PLMs.

Abstract

Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only require each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept Memorize Layer to measure the hardness of an instance. We incorporate memorized layer into reward function design, which allows "easy" instances to focus more on acceleration while "hard" instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks.
Paper Structure (20 sections, 7 equations, 7 figures, 6 tables)

This paper contains 20 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The training and inference procedure of ConsistentEE which formulates the early exiting process as a reinforcement learning problem. A policy network can make two possible actions, i.e., to exit, or to continue. If it exits, the corresponding internal classifier is required to predict the instance correctly, otherwise, no loss is incurred by the corresponding internal classifier.
  • Figure 2: Loss values at different layers on the RTE dataset using the weighted sum objective and ConsistentEE objective respectively. The dashed dot line is the classification boundary. A loss above the boundary means misclassification. The dashed line is the classification accuracy of each layer. The darker the color, the more samples share the same loss value.
  • Figure 3: Accuracy-Speed curves on the BERT-Base model. The evaluation metric for speedup is saved layers.
  • Figure 4: Accuracies and speedup ratios of different reward functions under varied $\alpha$.
  • Figure 5: Accuracies and speedup ratios of ConsistentEE under varied $\alpha$.
  • ...and 2 more figures