A Detailed Historical and Statistical Analysis of the Influence of Hardware Artifacts on SPEC Integer Benchmark Performance
Yueyao Wang, Samuel Furman, Nicolas Hardy, Margaret Ellis, Godmar Back, Yili Hong, Kirk Cameron
TL;DR
The paper analyzes how hardware artifacts have shaped SPEC CPU base integer speed since 1995, emphasizing normalization across SPEC generations and sensitivity analyses to parse system-factor effects. It compares constant vs regression-based normalization, finds the constant method preferable for cross-year comparisons, and demonstrates a strong link between microbenchmarks (notably gcc and perl) and overall performance after normalization. A focused investigation of libquantum reveals outsized influence in SPEC 2006, helping justify its removal in SPEC 2017 to stabilize scoring. The study also develops a predictive framework using nonlinear regression for mean trends, Gaussian process residuals for individual-config predictions, and quantile regression to explore future hardware scenarios, offering probabilistic forecasts and highlighting how cores, caches, and parallelism will influence future performance relative to Moore's Law.
Abstract
The Standard Performance Evaluation Corporation (SPEC) CPU benchmark has been widely used as a measure of computing performance for decades. The SPEC is an industry-standardized, CPU-intensive benchmark suite and the collective data provide a proxy for the history of worldwide CPU and system performance. Past efforts have not provided or enabled answers to questions such as, how has the SPEC benchmark suite evolved empirically over time and what micro-architecture artifacts have had the most influence on performance? -- have any micro-benchmarks within the suite had undue influence on the results and comparisons among the codes? -- can the answers to these questions provide insights to the future of computer system performance? To answer these questions, we detail our historical and statistical analysis of specific hardware artifacts (clock frequencies, core counts, etc.) on the performance of the SPEC benchmarks since 1995. We discuss in detail several methods to normalize across benchmark evolutions. We perform both isolated and collective sensitivity analyses for various hardware artifacts and we identify one benchmark (libquantum) that had somewhat undue influence on performance outcomes. We also present the use of SPEC data to predict future performance.
