Table of Contents
Fetching ...

SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token

Ming Ma, Bowen Zheng, Zhongqiao Lin, Tianming Yang

Abstract

Intermediate-layer predictions in large language models (LLMs) are informative but hard to decode accurately, especially at early layers. Existing lens-style methods typically rely on direct linear readout, which is simple but often drifts away from the model's eventual prediction. We proposeSimLens, a simple training-free decoder for single-token decision tasks that keeps only the start token and a candidate answer token ([s] and [a]) and performs one lightweight continuation through the remaining upper layers. This surprisingly small modification recovers much more accurate latent predictions than direct linear decoding. We further introduce Linear SimLens, a lightweight linear approximation for entropy-based confidence estimation, and combine the two in SimExit, a hybrid early-exit mechanism. On ARC, BoolQ, and HeadQA with LLaMA-7B and Vicuna-7B, SimLens improves Iso-Compute accuracy in all six settings, with an average gain of +0.43 even when fair compute includes the extra two-token post-forward overhead. SimExit yields an average 1.15$\times$ speedup at the best-accuracy operating points and 1.40$\times$ when allowing up to a 1 percentage-point accuracy drop. Ablations show that [s] and [a] play distinct roles as global condition and semantic anchor, respectively.

SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token

Abstract

Intermediate-layer predictions in large language models (LLMs) are informative but hard to decode accurately, especially at early layers. Existing lens-style methods typically rely on direct linear readout, which is simple but often drifts away from the model's eventual prediction. We proposeSimLens, a simple training-free decoder for single-token decision tasks that keeps only the start token and a candidate answer token ([s] and [a]) and performs one lightweight continuation through the remaining upper layers. This surprisingly small modification recovers much more accurate latent predictions than direct linear decoding. We further introduce Linear SimLens, a lightweight linear approximation for entropy-based confidence estimation, and combine the two in SimExit, a hybrid early-exit mechanism. On ARC, BoolQ, and HeadQA with LLaMA-7B and Vicuna-7B, SimLens improves Iso-Compute accuracy in all six settings, with an average gain of +0.43 even when fair compute includes the extra two-token post-forward overhead. SimExit yields an average 1.15 speedup at the best-accuracy operating points and 1.40 when allowing up to a 1 percentage-point accuracy drop. Ablations show that [s] and [a] play distinct roles as global condition and semantic anchor, respectively.

Paper Structure

This paper contains 29 sections, 11 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of SimExit. At each monitored layer, Linear SimLens provides a low-cost entropy confidence score for exit decisions. Once the confidence threshold is met, full-sequence forwarding is truncated and SimLens continues with only two tokens (<s> and <a>) to decode the answer.
  • Figure 2: Layer-wise answer accuracy on ARC (left), BoolQ (middle), and HeadQA (right). We compare Logit Lens, Tuned Lens, SimLens, Linear SimLens, and SimLens-NoS (SimLens without <s>). Dashed lines show naive full-model accuracy. SimLens variants decode useful answers earlier than linear baselines, while removing <s> consistently reduces performance.
  • Figure 3: Layer-wise cross-entropy/perplexity on WikiText-103 (left), BoolQ (middle), and HeadQA (right). Linear SimLens and Tuned Lens mappings are trained once on ARC and directly evaluated on target datasets. Lower is better. SimLens is training-free and remains strongest across most layers.
  • Figure 4: SimExit performance on ARC (left), BoolQ (middle), and HeadQA (right). Colors indicate different entropy thresholds. Lower thresholds delay exits and preserve accuracy; higher thresholds exit earlier and increase speed with a larger accuracy trade-off.
  • Figure 5: Cross-task transfer for SimExit. Left/middle: evaluate on BoolQ and HeadQA using Linear SimLens trained on ARC. Right: evaluate on ARC using Linear SimLens trained on BoolQ. The method preserves a favorable speed-accuracy trade-off without per-task retraining.
  • ...and 11 more figures