Table of Contents
Fetching ...

AbsenceBench: Language Models Can't Tell What's Missing

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, Ari Holtzman

TL;DR

<3-5 sentence high-level summary>AbsenceBench introduces a new benchmark to measure whether large language models can detect deliberately omitted information in long-context inputs, across poetry, numerical sequences, and GitHub pull requests. Unlike traditional presence-focused benchmarks like NIAH, AbsenceBench reveals a substantial gap in current models' ability to identify omissions, with performance improving dramatically only when explicit placeholders are used. The study analyzes prompting, context length, omission rate, and inference-time computation, finding that attention mechanisms struggle with gaps and that improvements come at high token-generation costs. The results motivate future work on absence-aware architectures and evaluation frameworks, highlighting practical implications for LLMs as judges and assistants in real-world tasks that require recognizing what is missing.

Abstract

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

AbsenceBench: Language Models Can't Tell What's Missing

TL;DR

<3-5 sentence high-level summary>AbsenceBench introduces a new benchmark to measure whether large language models can detect deliberately omitted information in long-context inputs, across poetry, numerical sequences, and GitHub pull requests. Unlike traditional presence-focused benchmarks like NIAH, AbsenceBench reveals a substantial gap in current models' ability to identify omissions, with performance improving dramatically only when explicit placeholders are used. The study analyzes prompting, context length, omission rate, and inference-time computation, finding that attention mechanisms struggle with gaps and that improvements come at high token-generation costs. The results motivate future work on absence-aware architectures and evaluation frameworks, highlighting practical implications for LLMs as judges and assistants in real-world tasks that require recognizing what is missing.

Abstract

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

Paper Structure

This paper contains 38 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: (a) An overview of the difference between the Needle-in-a-haystack (NIAH) test setting and AbsenceBench task setting. AbsenceBench is asking models to identify omitted pieces of content. (b) Performance of 5 SoTA LLMs on AbsenceBench is significantly lower than on the NIAH test, measured by F1-score. (c) an illustration of our task setting using the "haystack" metaphor, generated by ChatGPT.
  • Figure 2: The three domains in AbsenceBench test models' abilities across a variety of document lengths and omission probabilities. Frequency reports the number of tasks in the domain within a given range of document lengths. The average context length across all tasks in AbsenceBench is 5K tokens. On the document level, the average document length is 2.7K, while it is 4.7K for poetry, 1.5K for numerical sequences, and 1.7K for Github pull requests. We use the GPT-4 Tokenizerto measure document and context lengths.
  • Figure 3: Reasoning models often generate an order of magnitude more text than input document. Distribution of the thinking token ratio (number of generated thinking tokens divided by number of tokens in the original document) for four inference-time compute models under each domain. We set the parameters of the boxplot to capture 99% of the distribution. The outliers are hidden for better clarity (see Figure \ref{['fig:thining_boxplot_full']} for the full distribution).
  • Figure 4: Closed-source models (reds) perform better than open-weights models (blues) on AbsenceBench, while generating more thinking tokens. Each plot shows the average F1-score (x-axis) and the average thinking token ratio (y-axis). The grey line presents a visual boundary.
  • Figure 5: GPT-4.1-mini performs worse on longer tasks in Poetry, but the relationship is not clear in Numerical Sequences and Github PRs. Each plot shows the F1-score (y-axis) and the total context length (x-axis). Dark blue represents a lower and dark red represents a higher percentage of omission (number of omitted lines divided by total number of lines across each of three domains). Each dot on the graph represents the performance on a single instance. The dashed lines represent the least squares fit, with $R^2$ indicating the strength of correlation between the two axes.
  • ...and 3 more figures