Table of Contents
Fetching ...

LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

Massinissa Merouani, Afif Boudaoud, Riyadh Baghdadi

TL;DR

LOOPerSet tackles the data bottleneck in data-driven polyhedral compiler optimization by introducing a large-scale, public dataset of 28 million labeled examples derived from ~220k synthetic polyhedral programs. It uses a three-stage pipeline—synthetic program generation with diverse loop structures, relevance-guided transformation sampling, and ground-truth performance labeling on hardware—to produce semantically valid program-schedule pairs. The dataset is rigorously analyzed for diversity, showing broad coverage of program structures and no replication of existing benchmarks, and is designed to support pre-training and transfer learning for hardware portability. By providing accessible tooling and a permissive license, LOOPerSet aims to accelerate reproducible research and the development of advanced learned optimizers and cost models for polyhedral compilation.

Abstract

The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.

LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

TL;DR

LOOPerSet tackles the data bottleneck in data-driven polyhedral compiler optimization by introducing a large-scale, public dataset of 28 million labeled examples derived from ~220k synthetic polyhedral programs. It uses a three-stage pipeline—synthetic program generation with diverse loop structures, relevance-guided transformation sampling, and ground-truth performance labeling on hardware—to produce semantically valid program-schedule pairs. The dataset is rigorously analyzed for diversity, showing broad coverage of program structures and no replication of existing benchmarks, and is designed to support pre-training and transfer learning for hardware portability. By providing accessible tooling and a permissive license, LOOPerSet aims to accelerate reproducible research and the development of advanced learned optimizers and cost models for polyhedral compilation.

Abstract

The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.

Paper Structure

This paper contains 25 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The end-to-end data generation pipeline.
  • Figure 2: Distributions of key structural characteristics of the 220k synthetic programs in LOOPerSet. The plots show the diversity in program size, depth, memory access patterns, and iteration domain shapes.
  • Figure 3: Analysis of program workload and resource consumption. These plots illustrate the dynamic range covered by the dataset in terms of memory footprint, baseline execution time, computational complexity, and low-level operation and type mix.
  • Figure 4: Distribution of the transformation space exploration and the resulting performance impact. The plots show the number of legal schedules explored per program and the wide distribution of measured speedups, which provides the necessary signal for training a performance model.
  • Figure 5: This figure compares the minimum normalized Tree Edit Distance from each PolyBench benchmark to our 220,000-program synthetic dataset (purple) and to the other benchmarks within the PolyBench suite (green). The minimum distance to a synthetic program is never zero, confirming the absence of direct replication.
  • ...and 2 more figures