Table of Contents
Fetching ...

What is Wrong with Perplexity for Long-context Language Modeling?

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

TL;DR

The paper investigates why perplexity fails to reflect long-context abilities in large language models and demonstrates that averaging over all tokens masks the influence of a small set of key tokens. It introduces LongPPL, a token-weighted perplexity that concentrates on these key tokens identified via a long-short context analysis using LSD and LCL, and LongCE, a reweighting loss to emphasize key-token prediction during fine-tuning. Empirical results across LongBench, LongEval, and RULER show a strong correlation between LongPPL and long-context performance, with traditional PPL showing little correlation, and LongCE delivering consistent improvements (up to 22% on LongEval) with modest training overhead. These contributions provide a principled framework for evaluating and improving long-context capabilities in LLMs, with practical implications for benchmarking and fine-tuning.

Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.

What is Wrong with Perplexity for Long-context Language Modeling?

TL;DR

The paper investigates why perplexity fails to reflect long-context abilities in large language models and demonstrates that averaging over all tokens masks the influence of a small set of key tokens. It introduces LongPPL, a token-weighted perplexity that concentrates on these key tokens identified via a long-short context analysis using LSD and LCL, and LongCE, a reweighting loss to emphasize key-token prediction during fine-tuning. Empirical results across LongBench, LongEval, and RULER show a strong correlation between LongPPL and long-context performance, with traditional PPL showing little correlation, and LongCE delivering consistent improvements (up to 22% on LongEval) with modest training overhead. These contributions provide a principled framework for evaluating and improving long-context capabilities in LLMs, with practical implications for benchmarking and fine-tuning.

Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.

Paper Structure

This paper contains 26 sections, 10 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: (a) A constructed example to illustrate how LongPPL is calculated. We truncate the long context and calculate the generation probability difference (long-short difference, LSD, Eq. (\ref{['eq:lsd']})) for each token based on the long and short contexts. A high LSD score indicates that the token’s generation is significantly enhanced by the long context, making it a key token in the long text. LongPPL is then obtained by calculating perplexity on these key tokens. (b) Long-context performance (LongBench bai2023longbench) vs. perplexity measures (PPL and our LongPPL) computed on GovReport huang2021efficient, a natural corpus. While PPL shows no correlation w.r.t. Longbench score, LongPPL achieves $-0.96$ Pearson correlation coefficient.
  • Figure 2: (a) An example of the answer tokens in the LongEval task. (b&c) The correlation between accuracy and perplexity on answer tokens / non-answer tokens on LongEval. Each point represents the results obtained from testing at a specific prompt length ranging from 2k to 28k. The experiments is conducted using Yi-6B-200K young2024yi and CLEX-7B-64K chenclex.
  • Figure 3: (a) Token distribution categorized by long-short difference (LSD). (b) Distribution of tokens with LSD greater than 0.5 categorized by long-context likelihood (LCL). The tokens are from the standard response of LongEval illustrated in Figure \ref{['fig:gt-ppl']}.
  • Figure 4: (a) Distribution of tokens in GovReport categorized by long-short difference. (b) The classification accuracy of discriminating answer to non-answer tokens on LongEval with a classifier using different metrics (Random refers to a 50-50 random guess on two classes).
  • Figure 5: Correlation between the PPL-based metrics (LongPPL and PPL) on GovReport huang2021efficient and long-context benchmarks. LongPPL is calculated using Qwen2-72B-Instruct. Results of LongBench is in Figure \ref{['fig:PPLvsLongPPL']}.
  • ...and 6 more figures