Datasets for Studying Generalization from Easy to Hard Examples
Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Arpit Bansal, Zeyad Emam, Furong Huang, Micah Goldblum, Tom Goldstein
TL;DR
<3-5 sentence high-level summary> The paper introduces three new benchmarks—Prefix Sums, Mazes, and Chess Puzzles—to study how models generalize from easy to harder cases across reasoning tasks. It presents concrete data-generation pipelines that progressively raise difficulty (binary prefix sums, spanning-tree mazes, and Lichess puzzle moves encoded as pixel maps) and discusses the resulting input–label formats. A companion Python package enables easy access, generation, and visualization, lowering barriers to adopting these benchmarks. Together, the work offers practical, scalable resources to probe extrapolative generalization beyond standard IID settings across reasoning domains.
Abstract
We describe new datasets for studying generalization from easy to hard examples.
