Table of Contents
Fetching ...

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

Romain Storaï, Seung-won Hwang

TL;DR

Transformer inference often allocates equal compute per token, ignoring token difficulty. HARP introduces Hesitation-Aware Reframed Forward Pass, a training-free, model-agnostic method that triggers an additional reframed forward pass based on token-level uncertainty and then combines the results. By perturbing embeddings through dropout to obtain an alternate representation and only applying this when needed, HARP yields up to 5.16% accuracy gains across diverse datasets and model sizes, while keeping inference time well below beam search. This advances adaptive computation in transformers, showing that uncertainty-guided reframing can improve performance without retraining, with practical implications for efficient deployment of large language models.

Abstract

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

TL;DR

Transformer inference often allocates equal compute per token, ignoring token difficulty. HARP introduces Hesitation-Aware Reframed Forward Pass, a training-free, model-agnostic method that triggers an additional reframed forward pass based on token-level uncertainty and then combines the results. By perturbing embeddings through dropout to obtain an alternate representation and only applying this when needed, HARP yields up to 5.16% accuracy gains across diverse datasets and model sizes, while keeping inference time well below beam search. This advances adaptive computation in transformers, showing that uncertainty-guided reframing can improve performance without retraining, with practical implications for efficient deployment of large language models.

Abstract

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.

Paper Structure

This paper contains 31 sections, 4 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: The left side represents the Transformer's vanilla forward pass, while the right side illustrates the modified forward pass, HARP, which selectively applies additional computation by reframing inputs when the model hesitates. This improves performance on "harder" tokens without the need for retraining.
  • Figure 2: LLaMA 3.1 Instruct (8B) average relative inference time of the original model greedy search (Vanilla), beam search decoding (Beam S.), and our HARP (Ours) using greedy search decoding. Values and other models are detailed in Table \ref{['tab:inf_speed']}.
  • Figure 3: Answer to the given prompt generated using HARP. Orange tokens highlight additional forward steps ( i.e. tokens where uncertainty is higher than $\theta$). Blue tokens represent the model's top-1 predictions prior to reframing.
  • Figure 4: Multiple-choice Question prompt (CommonsenseQA and MMLU Pro).
  • Figure 5: GSM8K prompt.
  • ...and 2 more figures