Table of Contents
Fetching ...

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?

Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen

TL;DR

GSM-$\infty$ tackles the problem of benchmarking LLM reasoning under infinitely increasing context length by introducing a synthetic, graph-based benchmark that can scale both context and reasoning complexity. It builds problem statements from computational graphs with explicit and implicit operations, and injects noise via a spider-like topology to challenge information retrieval during solving. The authors present a reverse-mode data generation to produce implicit subtraction and division, three real-world templates to maintain linguistic variety, and a comprehensive evaluation across dozens of models showing sigmoid-like degradation as complexity grows and linear AUC gains with exponentially more inference compute. The benchmark provides a scalable testbed for systematically studying LLM reasoning in dense, long contexts and highlights fundamental scaling limits, guiding future research on training and inference strategies for advanced reasoning tasks.

Abstract

Long-context large language models (LLMs) have recently shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs, and the ability to introduce noise by adding unnecessary nodes and edges, we develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-Infinite benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?

TL;DR

GSM- tackles the problem of benchmarking LLM reasoning under infinitely increasing context length by introducing a synthetic, graph-based benchmark that can scale both context and reasoning complexity. It builds problem statements from computational graphs with explicit and implicit operations, and injects noise via a spider-like topology to challenge information retrieval during solving. The authors present a reverse-mode data generation to produce implicit subtraction and division, three real-world templates to maintain linguistic variety, and a comprehensive evaluation across dozens of models showing sigmoid-like degradation as complexity grows and linear AUC gains with exponentially more inference compute. The benchmark provides a scalable testbed for systematically studying LLM reasoning in dense, long contexts and highlights fundamental scaling limits, guiding future research on training and inference strategies for advanced reasoning tasks.

Abstract

Long-context large language models (LLMs) have recently shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs, and the ability to introduce noise by adding unnecessary nodes and edges, we develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-Infinite benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

Paper Structure

This paper contains 37 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Evaluation of 10 powerful LLMs on GSM-$\infty$, comparing API generation cost (horizontal axis) with zero-context reasoning ability (vertical axis). Bubble size represents reasoning performance at a 16K context length.
  • Figure 2: (a) We position existing benchmarks across the Reasoning complexity versus context length plot. Reasoning datasets are usually of very short context. Existing long context benchmarks are usually low in reasoning complexity. Our task can cover any context length that the user so chooses and can generate infinite reasoning complexity. However, for high reasoning complexity, our task needs to use a longer context for problems. Our task is shown in Red. (b) A simplified example of our dataset-building process. We first generate an interconnected computational graph, and we then based on the graph, attach real-world context to it to formulate the problem statements. (c) Shows Qwen2.5-72B-Instruct Score decay across zero-context, 8K, 16K, and 32K.
  • Figure 3: Study of Llama3.1-70B-Instruct with Passive RAG (referred to as OnePassRAG) and Active RAG (referred to as InteractiveRAG) on popular long-context benchmarks: RULER (at 64K context length), LongBench ($>$8K), LongBenchV2, and LOFT (128K context length). RAG is under the 2048 retrieved token budget, and the decoder used for the RAG is Llama-3.1-70B-Instruct. RAGs generally have robust performance, on par with the corresponding LLMs, showing that previous long-context benchmarks are either too simple in reasoning complexity or contain detectable noise.
  • Figure 4: (a) presents a conservative estimate for each problem difficulty in GSM-8K 1.3K test set. We evaluate the difficulty of the problems by the number of operations needed to get to the final answer. The op count ranges from 2 to 12, while most are around 3-4. (b) shows the Llama3.1-8B-Instruct performance across different semantics hierarchies, revealing the hidden reasoning difficulty innate in natural language.
  • Figure 5: RAG performance on our proposed long-context benchmarks. (a) studies retriever's behavior on the first 100 chunks of a random problem in vt from RULER with 8192 context length. The chunks that need to be retrieved to solve the problem are labeled in coral, while the noise is in blue. The chunks have retriever scores ranked from large (semantically far) to small (semantically close). Retriever locates the essential chunks with high precision, classifying all necessary chunks with the right side of the spectrum; (b) contrasts vt with our long-context benchmarks, showing that the retriever cannot locate precisely which chunk to retrieve. (c) and (d) display the performance of two RAG systems on our benchmark medium and hard tasks. (Figure best viewed in color)
  • ...and 9 more figures