Compilation of Generalized Matrix Chains with Symbolic Sizes
Francisco López, Lars Karlsson, Paolo Bientinesi
TL;DR
This paper tackles the problem of efficiently evaluating Generalized Matrix Chains (GMCs) when matrix sizes are symbolic at compile time. It introduces a multi-versioning code generator that emits a small set of variants and a run-time dispatcher to select the best variant for given sizes, backed by theoretical results guaranteeing a constant-factor bound relative to the optimum. The method also includes an empirical expansion procedure to add variants when needed, balancing code size against performance. Experiments show substantial improvements over single-variant approaches and competitive performance against Armadillo, with overheads kept under $15\%$ in FLOPs for most cases and significant time-speedups when using the expanded variant sets.
Abstract
Generalized Matrix Chains (GMCs) are products of matrices where each matrix carries features (e.g., general, symmetric, triangular, positive-definite) and is optionally transposed and/or inverted. GMCs are commonly evaluated via sequences of calls to BLAS and LAPACK kernels. When matrix sizes are known, one can craft a sequence of kernel calls to evaluate a GMC that minimizes some cost, e.g., the number of floating-point operations (FLOPs). Even in these circumstances, high-level languages and libraries, upon which users usually rely, typically perform a suboptimal mapping of the input GMC onto a sequence of kernels. In this work, we go one step beyond and consider matrix sizes to be symbolic (unknown); this changes the nature of the problem since no single sequence of kernel calls is optimal for all possible combinations of matrix sizes. We design and evaluate a code generator for GMCs with symbolic sizes that relies on multi-versioning. At compile-time, when the GMC is known but the sizes are not, code is generated for a few carefully selected sequences of kernel calls. At run-time, when sizes become known, the best generated variant for the matrix sizes at hand is selected and executed. The code generator uses new theoretical results that guarantee that the cost is within a constant factor from optimal for all matrix sizes and an empirical tuning component that further tightens the gap to optimality in practice. In experiments, we found that the increase above optimal in both FLOPs and execution time of the generated code was less than 15\% for 95\% of the tested chains.
