Table of Contents
Fetching ...

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen

TL;DR

HELMET tackles the fragmentation and unreliability of existing long-context language model benchmarks by delivering a diverse, application-centric evaluation suite that supports 128K+ contexts, uses model-based metrics, and employs robust prompting. Through a large-scale study of 59 frontier LCLMs, it shows that synthetic tasks poorly predict downstream performance, reveals distinct trends across seven task categories, and highlights a notable performance gap between open-source and closed models on complex, long-context reasoning. The results argue for holistic evaluation across multiple axes and identify RAG-based tasks as a practical, predictive subset for rapid development. Together, HELMET provides a principled framework to benchmark and advance frontier long-context models in real-world scenarios.

Abstract

Many benchmarks exist for evaluating long-context language models (LCLMs), yet developers often rely on synthetic tasks such as needle-in-a-haystack (NIAH) or an arbitrary subset of tasks. However, it remains unclear whether these benchmarks reflect the diverse downstream applications of LCLMs, and such inconsistencies further complicate model comparison. We investigate the underlying reasons behind these practices and find that existing benchmarks often provide noisy signals due to limited coverage of applications, insufficient context lengths, unreliable metrics, and incompatibility with base models. In this work, we introduce HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address several issues in previous benchmarks by adding controllable lengths up to 128K tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH do not reliably predict downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlations with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning or following complex instructions -- the gap widens as length increases. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and better predict other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

TL;DR

HELMET tackles the fragmentation and unreliability of existing long-context language model benchmarks by delivering a diverse, application-centric evaluation suite that supports 128K+ contexts, uses model-based metrics, and employs robust prompting. Through a large-scale study of 59 frontier LCLMs, it shows that synthetic tasks poorly predict downstream performance, reveals distinct trends across seven task categories, and highlights a notable performance gap between open-source and closed models on complex, long-context reasoning. The results argue for holistic evaluation across multiple axes and identify RAG-based tasks as a practical, predictive subset for rapid development. Together, HELMET provides a principled framework to benchmark and advance frontier long-context models in real-world scenarios.

Abstract

Many benchmarks exist for evaluating long-context language models (LCLMs), yet developers often rely on synthetic tasks such as needle-in-a-haystack (NIAH) or an arbitrary subset of tasks. However, it remains unclear whether these benchmarks reflect the diverse downstream applications of LCLMs, and such inconsistencies further complicate model comparison. We investigate the underlying reasons behind these practices and find that existing benchmarks often provide noisy signals due to limited coverage of applications, insufficient context lengths, unreliable metrics, and incompatibility with base models. In this work, we introduce HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address several issues in previous benchmarks by adding controllable lengths up to 128K tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH do not reliably predict downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlations with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning or following complex instructions -- the gap widens as length increases. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and better predict other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.
Paper Structure (53 sections, 14 figures, 19 tables)

This paper contains 53 sections, 14 figures, 19 tables.

Figures (14)

  • Figure 1: Long-context benchmark results of frontier LCLMs (Llama-3.1 8B/70B, GPT-4o-mini, GPT-4o-08-06, and Gemini-1.5 Flash/Pro) at 128K input length. NIAH is saturated for almost all models; RULER hsieh2024ruler and $\infty$Benchzhang2024inftybenchextendinglongcontext show unexpected trends for Llama-3.1 dubey2024llama3herdmodels. In contrast, HELMET demonstrates more consistent rankings of these frontier models.
  • Figure 2: Comparison between ROUGE-L F1 and our model-based evaluation metric on summarization tasks. Our metric shows more consistent trends: it reflects the performance gain on GPT-4o with increased input length, while ROUGE remains almost the same; our metric also clearly differentiates models while ROUGE shows little distinction.
  • Figure 3: Spearman's rank correlation at 128K input length, calculated across 35 instruction-tuned models.
  • Figure 4: Distribution of instruction-tuned models' performance on $\infty$Bench QA with respect to NIAH, RULER MK, and HotpotQA.
  • Figure 5: Spearman rank correlation between different categories at $L=$128K.
  • ...and 9 more figures