CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization
Jacob O. Tørring, Carl Hvarfner, Luigi Nardi, Magnus Själander
TL;DR
CATBench addresses the lack of standardized benchmarks for Bayesian optimization in compiler autotuning by introducing a comprehensive, containerized benchmarking suite extended from BaCO. It combines real-world tasks (TACO and RISE/ELEVATE) with multi-fidelity and multi-objective evaluation, plus a scalable client-server interface using gRPC and Docker for reproducible experimentation. Empirical results show BaCO outperforms non-BO baselines on single-objective TACO and highlights robust Pareto-front structure for RISE/ELEVATE, while revealing hardware-induced variability and informative feature-importance patterns. The suite enables reproducible benchmarking, transfer learning studies, and rapid algorithm development, with clear pathways for expansion and community-driven leaderboards.
Abstract
Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures the complexities of compiler autotuning, ranging from discrete, conditional, and permutation parameter types to known and unknown binary constraints, as well as both multi-fidelity and multi-objective evaluations. The benchmarks in CATBench span a range of machine learning-oriented computations, from tensor algebra to image processing and clustering, and uses state-of-the-art compilers, such as TACO and RISE/ELEVATE. CATBench offers a unified interface for evaluating Bayesian optimization algorithms, promoting reproducibility and innovation through an easy-to-use, fully containerized setup of both surrogate and real-world compiler optimization tasks. We validate CATBench on several state-of-the-art algorithms, revealing their strengths and weaknesses and demonstrating the suite's potential for advancing both Bayesian optimization and compiler autotuning research.
