Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs
Letian Cheng, Junyan Wang, Yan Gao, Elliott Wen, Ting Dang, Hong Jia
TL;DR
This paper addresses the unreliability of perplexity ($PPL$) as a metric for LLMs, especially for long inputs, and argues that input length should be a first-class system variable. It introduces LengthBenchmark, a framework that unifies input length, evaluation protocols (sliding vs non-sliding), and system-level costs, including quantization variants. Key findings show that sliding-window evaluation biases short-input performance, while both full-precision and quantized models gain from longer evaluated segments, with varying efficiency trade-offs. This work advances length-aware benchmarking to improve cross-model fairness and deployment decision guidance.
Abstract
Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.
