Table of Contents
Fetching ...

Numerical Superoptimization for Library Learning

Jonas Regehr, Mitch Briles, Zachary Tatlock, Pavel Panchekha

Abstract

Numerical software depends on fast, accurate implementations of mathematical primitives like sin, exp, and log. Modern superoptimizers can optimize floating-point kernels against a given set of such primitives, but a more fundamental question remains open: which new primitives are worth implementing in the first place? We formulate this as numerical library learning: given a workload of floating-point kernels, identify the mathematical primitives whose expert implementations would most improve speed and accuracy. Our key insight is that numerical superoptimizers already have the machinery well-suited to this problem. Their search procedures happen to enumerate candidate primitives, their equivalence procedures can generalize and deduplicate candidates, and their cost models can estimate counterfactual utility: how much the workload would improve if a given primitive were available. We present GrowLibm, which repurposes the Herbie superoptimizer as a numerical library learner. GrowLibm mines candidate primitives from the superoptimizer's intermediate search results, ranks them by counterfactual utility, and prunes redundant candidates. Across three scientific applications (PROJ, CoolProp, and Basilisk), GrowLibm identifies compact, reusable primitives that can be implemented effectively using standard numerical techniques. When Herbie is extended with these expert implementations, kernel speed improves by up to 2.2x at fixed accuracy, and maximum achievable accuracy also improves, in one case from 56.0% to 93.5%. We also prototype an LLVM matcher that recognizes learned primitives in optimized IR, recovering 26 replacement sites across five PROJ projections and improving end-to-end application performance by up to 5%.

Numerical Superoptimization for Library Learning

Abstract

Numerical software depends on fast, accurate implementations of mathematical primitives like sin, exp, and log. Modern superoptimizers can optimize floating-point kernels against a given set of such primitives, but a more fundamental question remains open: which new primitives are worth implementing in the first place? We formulate this as numerical library learning: given a workload of floating-point kernels, identify the mathematical primitives whose expert implementations would most improve speed and accuracy. Our key insight is that numerical superoptimizers already have the machinery well-suited to this problem. Their search procedures happen to enumerate candidate primitives, their equivalence procedures can generalize and deduplicate candidates, and their cost models can estimate counterfactual utility: how much the workload would improve if a given primitive were available. We present GrowLibm, which repurposes the Herbie superoptimizer as a numerical library learner. GrowLibm mines candidate primitives from the superoptimizer's intermediate search results, ranks them by counterfactual utility, and prunes redundant candidates. Across three scientific applications (PROJ, CoolProp, and Basilisk), GrowLibm identifies compact, reusable primitives that can be implemented effectively using standard numerical techniques. When Herbie is extended with these expert implementations, kernel speed improves by up to 2.2x at fixed accuracy, and maximum achievable accuracy also improves, in one case from 56.0% to 93.5%. We also prototype an LLVM matcher that recognizes learned primitives in optimized IR, recovering 26 replacement sites across five PROJ projections and improving end-to-end application performance by up to 5%.

Paper Structure

This paper contains 27 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: The GrowLibm pipeline. Given numerical kernels converted to FPCore fpcorefpbench, a standard floating-point interchange format, the generation phase runs Herbie to explore equivalent programs, extracts subexpressions from Herbie's intermediates, and canonicalizes and deduplicates them. The selection phase iteratively ranks candidates by frequency, urgency, and size; resolves implications between top candidates so that redundant variants are not both selected; and adds the winners to Herbie's platform for re-evaluation. A final Herbie pass confirms which candidates are actually used in optimized kernels, producing the proposed primitives.
  • Figure 2: The forward ellipsoidal projection function for the Swiss Oblique Mercator projection from the PROJ library somerc.cpp. Note the complex compositions of mathematical functions, including $2 \operatorname{atan}(\exp(x)) - \pi/2$, $\log(\tan(\pi/4 + 0.5 x))$, and $\log((1 + x) / (1 - x))$. Accurately evaluating these compositions is challenging in floating point and generally beyond the capabilities of existing numerical superoptimizers.
  • Figure 3: A numerical expert's implementation of $\log((1 + x) / (1 - x))$, using a polynomial that is accurate for inputs $|x| < .1716$. This implementation borrows from the famous fdlibm math library. The input range is sufficient for the use case in \ref{['fig:somerc']} because there the relevant input $x$ is the eccentricity of the Earth times the sine of the oblique angle, well within that input range.
  • Figure 4: Other lines of code in PROJ that compute the log1pmd expression. Note that some of these uses have the $1 + x$ term in the numerator of the division, but others have it in the denominator. This is common in numerical code and means that finding common functions requires reasoning about algebraic equivalence and composition.
  • Figure 5: Distributions of the size, frequency, and urgency heuristics for a PROJ run. Size and frequency are well-distributed, providing good discriminative signal. Urgency is concentrated near zero with a long tail of high-urgency candidates, confirming the bimodal structure that justifies two-stage filtering.
  • ...and 6 more figures