Table of Contents
Fetching ...

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley, Manish Bhattarai, Nishath Rajiv Ranasinghe, Erick Draayer, Javier Santos

TL;DR

The paper introduces an OEIS-based benchmark to evaluate large language models on generating Python code for integer sequences, focusing on mathematical and algorithmic reasoning under time constraints. It curates 1000 sequences divided into easy/hard and contemporary/classic, and employs a cheating-detection mechanism to ensure solutions rely on genuine algorithmic generation rather than memorized values. Empirical results show reasoning-focused models outperform non-reasoning counterparts, especially on hard sequences, while highlighting persistent challenges in complex algorithmic tasks. A case study demonstrates how reasoning models leverage memoization to improve efficiency, underscoring the qualitative differences in approach. The work lays a foundation for robust, updateable evaluation and suggests future extensions such as tool use and retrieval-augmented generation to advance mathematical reasoning in LLMs.

Abstract

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

Benchmarking Large Language Models with Integer Sequence Generation Tasks

TL;DR

The paper introduces an OEIS-based benchmark to evaluate large language models on generating Python code for integer sequences, focusing on mathematical and algorithmic reasoning under time constraints. It curates 1000 sequences divided into easy/hard and contemporary/classic, and employs a cheating-detection mechanism to ensure solutions rely on genuine algorithmic generation rather than memorized values. Empirical results show reasoning-focused models outperform non-reasoning counterparts, especially on hard sequences, while highlighting persistent challenges in complex algorithmic tasks. A case study demonstrates how reasoning models leverage memoization to improve efficiency, underscoring the qualitative differences in approach. The work lays a foundation for robust, updateable evaluation and suggests future extensions such as tool use and retrieval-augmented generation to advance mathematical reasoning in LLMs.

Abstract

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Each panel visualizes an individual OEIS sequence using integer-valued $(n,a(n))$ pairs plotted as raw scatter plots without smoothing. Top-left (A265326): This sequence forms a striking pattern of diagonal parallelograms, caused by taking each prime $p$, reversing its binary expansion, and subtracting: $a(n) = p_n - \text{reverse}(p_n)$, where $p_n$ is the $n$-th prime. The symmetry arises because reversals often yield other primes, and transitions occur at binary boundaries (e.g., powers of 2), expanding with scale. Top-right (A133058): This chaotic-looking trajectory dramatically stabilizes after n=640, where it enters a perfectly repeating three-term loop. N. J. A. Sloane famously compared this to the scene in Avatar where Jake Sully finally tames his Banshee: “fly straight, dammit.” Bottom-left (A229037): A non-averaging, fractal-like sequence that forbids 3-term arithmetic progressions. Its dense layering and soft envelope illustrate global constraints emerging from a purely local rule. Bottom-right (A005185): Hofstadter's Q-sequence, a meta-Fibonacci recursion that lacks a known growth law or closed-form solution. Despite its recursive chaos, the values tightly track a diagonal, hinting at regularity buried in self-reference.
  • Figure 2: Workflow for curating the OEIS‐based benchmark dataset. Starting from the full OEIS collection, we first filter by a July 2024 timeline cutoff into "Classic" (pre-cutoff) and "Contemporary" (post-cutoff) sequences. Each branch is then split by the OEIS "easy"/"hard" tags into four subsets: Classic Easy, Classic Hard, Contemporary Easy, and Contemporary Hard, each containing 250 sequences. Finally, these are recombined into the 1,000-sequence benchmark set.
  • Figure 3: Distribution of scores for the top three reasoning and non-reasoning models. Shown are score distributions for the hard sequences (red for reasoning models, yellow for non-reasoning models) and easy sequences (blue for reasoning models, green for non-reasoning models). The percentage of sequences for which each model achieves a perfect score is shown on the right. All models show distributions skewed toward low scores on the hard sequences, while non-reasoning models have near-uniform scores on the easy set and reasoning models are strongly skewed toward high scores.
  • Figure 4: Classification of error modes for top reasoning and non-reasoning models. Shown are failure types for three top-performing reasoning and non-reasoning models on both the hard and easy sequence sets. Lookup-table use and memorization occur much more frequently on the hard sequences than on the easy ones.