HARP: Hesitation-Aware Reframing in Transformer Inference Pass
Romain Storaï, Seung-won Hwang
TL;DR
Transformer inference often allocates equal compute per token, ignoring token difficulty. HARP introduces Hesitation-Aware Reframed Forward Pass, a training-free, model-agnostic method that triggers an additional reframed forward pass based on token-level uncertainty and then combines the results. By perturbing embeddings through dropout to obtain an alternate representation and only applying this when needed, HARP yields up to 5.16% accuracy gains across diverse datasets and model sizes, while keeping inference time well below beam search. This advances adaptive computation in transformers, showing that uncertainty-guided reframing can improve performance without retraining, with practical implications for efficient deployment of large language models.
Abstract
This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.
