Table of Contents
Fetching ...

The CLRS-Text Algorithmic Reasoning Language Benchmark

Larisa Markeeva, Sean McLeish, Borja Ibarz, Wilfried Bounsi, Olga Kozlova, Alex Vitvitskyi, Charles Blundell, Tom Goldstein, Avi Schwarzschild, Petar Veličković

TL;DR

The paper introduces CLRS-Text, a textual benchmark that converts CLRS algorithmic traces into language-model prompts, enabling structured, multi-task evaluation of algorithmic reasoning across 30 tasks. It motivates robust out-of-distribution generalization and uses resampling to avoid static-test biases, providing a controlled framework for comparing LM reasoning across publications. Through multi-task fine-tuning of Gemma 2B (with and without randomized positional embeddings) and zero-shot/extrapolation evaluations, the study reveals that while positional randomness aids generalization, extrapolation remains challenging for LM-based reasoning, underscoring the need for future approaches such as chain-of-thought and tool integration. Overall, CLRS-Text offers a standardized, extensible platform for assessing and advancing LM algorithmic reasoning.

Abstract

Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text -- a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at https://github.com/google-deepmind/clrs/tree/master/clrs/_src/clrs_text.

The CLRS-Text Algorithmic Reasoning Language Benchmark

TL;DR

The paper introduces CLRS-Text, a textual benchmark that converts CLRS algorithmic traces into language-model prompts, enabling structured, multi-task evaluation of algorithmic reasoning across 30 tasks. It motivates robust out-of-distribution generalization and uses resampling to avoid static-test biases, providing a controlled framework for comparing LM reasoning across publications. Through multi-task fine-tuning of Gemma 2B (with and without randomized positional embeddings) and zero-shot/extrapolation evaluations, the study reveals that while positional randomness aids generalization, extrapolation remains challenging for LM-based reasoning, underscoring the need for future approaches such as chain-of-thought and tool integration. Overall, CLRS-Text offers a standardized, extensible platform for assessing and advancing LM algorithmic reasoning.

Abstract

Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text -- a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at https://github.com/google-deepmind/clrs/tree/master/clrs/_src/clrs_text.
Paper Structure (8 sections, 6 figures, 1 table)

This paper contains 8 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Top: The graph algorithmic trace of insertion sorting a list $[5,2,4,3,1]$ in graph form (reprinted from velivckovic2022clrs). Bottom: The same algorithmic trace, represented textually, by using our provided CLRS-Text generator. The model receives as input (depicted in green) the input array (key) and the initial value of the sorting trace (initial_trace), using which it is prompted to predict the trace (depicted in blue) of gradually sorting the list, by inserting one element at a time into a partially sorted list, from left to right. At the end, the model needs to output the final sorted array (depicted in red), and it is evaluated on whether this array is predicted correctly.
  • Figure 2: Resampling test results of variants of Gemma 2B, and Gemini 1.5 Flash, on various problem sizes. Gemma 2B variants were explicitly trained on CLRS-Text tasks---the training set sizes are denoted by red dots---and are evaluated zero-shot. Gemini 1.5 Flash is a pre-trained general-purpose model, evaluated in a two-shot manner. This plot only shows results on eight representative algorithms due to space constraints---the detailed plots for all thirty algorithms are available in Appendix \ref{['app:full']}.
  • Figure 3: Top: The graph algorithmic trace of optimising the order of multiplications in a chain of matrices, for multiplying matrices of size $(10\times 30)(30\times 5)(5\times 60)$, assuming a $O(n^3)$-time multiplication algorithm (reprinted from velivckovic2022clrs). Bottom: The same algorithmic trace, represented textually, by using our provided CLRS-Text generator. The model receives the input matrix sizes (p) and the initial value of the pointers (initial_trace), using which it is prompted to predict the trace of gradually determining optimal orders of multiplying various subchains of the original chain of matrices. Note that, in our default data generator, we do not store intermediate numbers of operations---only the pointers are preserved in the trace.
  • Figure 4: Top: The graph algorithmic trace of finding single-source shortest paths (from node zero) using the Bellman-Ford algorithm, for a given undirected weighted graph (reprinted from velivckovic2022clrs). Bottom: The same algorithmic trace, represented textually, by using our provided CLRS-Text generator. The model receives the source node identity (s), the weighted adjacency matrix (A) and the initial value of the predecessor pointers (initial_trace), using which it is prompted to predict the trace of gradually recomputing predecessor pointers until all single-source shortest paths are found. Note that, in our default data generator, we do not store intermediate path lengths---only the pointers are preserved in the trace.
  • Figure 5: Top: The graph algorithmic trace of finding the first occurence of the string ab inside the string aab (reprinted from velivckovic2022clrs). Bottom: The same algorithmic trace, represented textually, by using our provided CLRS-Text generator. The model receives the identifier of which character belongs to which string (string), the value of each character (key) and the initial value of the position at which the match is queried (initial_trace). Using this information, the model is prompted to predict the trace of gradually adjusting the querying position until a full match is found.
  • ...and 1 more figures