Table of Contents
Fetching ...

Datasets for Studying Generalization from Easy to Hard Examples

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Arpit Bansal, Zeyad Emam, Furong Huang, Micah Goldblum, Tom Goldstein

TL;DR

<3-5 sentence high-level summary> The paper introduces three new benchmarks—Prefix Sums, Mazes, and Chess Puzzles—to study how models generalize from easy to harder cases across reasoning tasks. It presents concrete data-generation pipelines that progressively raise difficulty (binary prefix sums, spanning-tree mazes, and Lichess puzzle moves encoded as pixel maps) and discusses the resulting input–label formats. A companion Python package enables easy access, generation, and visualization, lowering barriers to adopting these benchmarks. Together, the work offers practical, scalable resources to probe extrapolative generalization beyond standard IID settings across reasoning domains.

Abstract

We describe new datasets for studying generalization from easy to hard examples.

Datasets for Studying Generalization from Easy to Hard Examples

TL;DR

<3-5 sentence high-level summary> The paper introduces three new benchmarks—Prefix Sums, Mazes, and Chess Puzzles—to study how models generalize from easy to harder cases across reasoning tasks. It presents concrete data-generation pipelines that progressively raise difficulty (binary prefix sums, spanning-tree mazes, and Lichess puzzle moves encoded as pixel maps) and discusses the resulting input–label formats. A companion Python package enables easy access, generation, and visualization, lowering barriers to adopting these benchmarks. Together, the work offers practical, scalable resources to probe extrapolative generalization beyond standard IID settings across reasoning domains.

Abstract

We describe new datasets for studying generalization from easy to hard examples.

Paper Structure

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Prefix Sums input samples and their corresponding targets/labels. We provide multiple sets, each containing problems of a different length, and intend for users to train on shorter strings and test on longer ones. Examples from the sets of length 16 and 28 are shown above.
  • Figure 2: The Mazes generation process for making a $5 \times 5$ maze. We start with a $3 \times 3$ grid graph. Each side of the grid graph contains 3 nodes and 2 edges ($3+2=5$ total elements). We then produce a spanning tree for the graph using a randomized algorithm. The tree encodes the allowed and forbidden paths. This tree is then represented as a $5 \times 5$ array of maze cells. Finally, we convert this to an image by representing each cell as a $2 \times 2$ array of pixels, and adding a 3-pixel border on each side. This creates an image representation that has $5\times 2 + 3 + 3 = 16$ pixels on each side. The green and red start and end cells are chosen at random.
  • Figure 3: Samples from Mazes. Mazes (top) and their labels/solutions (bottom) of size $9 \times 9$, $13 \times 13$, and $21 \times 21$.
  • Figure 4: Samples from Chess Puzzles. "Easy" chess puzzles, each with their solution below. The board state is represented as a $12\times 8\times 8$ tensor, where the first 6 maps encode the pieces belonging to the player to act next. The white player is acting in the two leftmost puzzles, while the black player acts in the puzzle on the right. The solution is represented as a $1\times 8\times 8$ tensor that marks the start and end position of the piece to be moved.