Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley; Manish Bhattarai; Nishath Rajiv Ranasinghe; Erick Draayer; Javier Santos

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley, Manish Bhattarai, Nishath Rajiv Ranasinghe, Erick Draayer, Javier Santos

TL;DR

The paper introduces an OEIS-based benchmark to evaluate large language models on generating Python code for integer sequences, focusing on mathematical and algorithmic reasoning under time constraints. It curates 1000 sequences divided into easy/hard and contemporary/classic, and employs a cheating-detection mechanism to ensure solutions rely on genuine algorithmic generation rather than memorized values. Empirical results show reasoning-focused models outperform non-reasoning counterparts, especially on hard sequences, while highlighting persistent challenges in complex algorithmic tasks. A case study demonstrates how reasoning models leverage memoization to improve efficiency, underscoring the qualitative differences in approach. The work lays a foundation for robust, updateable evaluation and suggests future extensions such as tool use and retrieval-augmented generation to advance mathematical reasoning in LLMs.

Abstract

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

Benchmarking Large Language Models with Integer Sequence Generation Tasks

TL;DR

Abstract

Benchmarking Large Language Models with Integer Sequence Generation Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)