Table of Contents
Fetching ...

Logic Haystacks: Probing LLMs Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)

Damien Sileo

TL;DR

This work investigates true long-context logical reasoning in large language models by constructing Logic Haystacks, a dataset of long premises built from First-Order Logic and simplified English up to $2048$ clauses ($≈$25{,}000 tokens). It introduces a scalable generation pipeline using satisfiable merging and counterfactual evidence analysis to isolate necessary and sufficient evidence for contradiction. The study demonstrates that realistic distractors drastically shrink the effective context window, with 2-evidence retrieval remaining very challenging even for large models, underscoring limits of current architectures. By providing an unbiased, padding-free benchmark for long-context reasoning, the work motivates future research on genuine long-context understanding and targeted training in logical reasoning.

Abstract

Large language models demonstrate promising long context processing capabilities, with recent models touting context windows close to one million tokens. However, the evaluations supporting these claims often involve simple retrieval tasks or synthetic tasks padded with irrelevant text, which the models may easily detect and discard. In this work, we generate lengthy simplified English text with first-order logic representations spanning up to 2048 clauses (around 25k GPT-4 tokens). We formulate an evaluation task with evidence retrieval for contradiction detection. The long, homogeneous text is filled with distractors that are both hard to distinguish from relevant evidences and provably not interfering with them. Our evaluation of evidence retrieval shows that the effective context window is much smaller with realistic distractors, already crumbling at 128 clauses.

Logic Haystacks: Probing LLMs Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)

TL;DR

This work investigates true long-context logical reasoning in large language models by constructing Logic Haystacks, a dataset of long premises built from First-Order Logic and simplified English up to clauses (25{,}000 tokens). It introduces a scalable generation pipeline using satisfiable merging and counterfactual evidence analysis to isolate necessary and sufficient evidence for contradiction. The study demonstrates that realistic distractors drastically shrink the effective context window, with 2-evidence retrieval remaining very challenging even for large models, underscoring limits of current architectures. By providing an unbiased, padding-free benchmark for long-context reasoning, the work motivates future research on genuine long-context understanding and targeted training in logical reasoning.

Abstract

Large language models demonstrate promising long context processing capabilities, with recent models touting context windows close to one million tokens. However, the evaluations supporting these claims often involve simple retrieval tasks or synthetic tasks padded with irrelevant text, which the models may easily detect and discard. In this work, we generate lengthy simplified English text with first-order logic representations spanning up to 2048 clauses (around 25k GPT-4 tokens). We formulate an evaluation task with evidence retrieval for contradiction detection. The long, homogeneous text is filled with distractors that are both hard to distinguish from relevant evidences and provably not interfering with them. Our evaluation of evidence retrieval shows that the effective context window is much smaller with realistic distractors, already crumbling at 128 clauses.

Paper Structure

This paper contains 17 sections, 4 figures.

Figures (4)

  • Figure 1: One evidence
  • Figure 2: Two evidences
  • Figure 3: One evidence, not isolating the hypothesis
  • Figure 4: One evidence, impact of model size