Fast in-place accumulation

Jean-Guillaume Dumas; Bruno Grenet

Fast in-place accumulation

Jean-Guillaume Dumas, Bruno Grenet

TL;DR

The paper addresses the longstanding space-time trade-off in fast algebraic algorithms by introducing a reversible in-place accumulation model that allows inputs to be temporarily modified and later restored. It provides a general transformation that converts any bilinear (and more generally linear accumulation) algorithm into an in-place accumulating variant, enabling fast Strassen-like matrix multiplication and fast polynomial multiplication with minimal extra memory. This framework yields in-place algorithms for a wide range of linear-algebra subroutines, Toeplitz/circulant operations, convolutions, and modular remainder/multiplication, often matching the asymptotic complexity of their not-in-place counterparts while using only one small memory footprint. The results demonstrate practical improvements and provide automatic design tools, with extensions to over-place variants and applications to polynomial extensions of finite fields, along with open questions about logarithmic overhead in some regimes.

Abstract

This paper deals with simultaneously fast and in-place algorithms for formulae where the result has to be linearly accumulated: some output variables are also input variables, linked by a linear dependency. Fundamental examples include the in-place accumulated multiplication of polynomials or matrices (that is with only $O(1)$ extra space). The difficulty is to combine in-place computations with fast algorithms: those usually come at the expense of (potentially large) extra temporary space. We first propose a novel automatic design of fast and in-place accumulating algorithms for any bilinear formulae (and thus for polynomial and matrix multiplication) and then extend it to any linear accumulation of a collection of functions. For this, we relax the in-place model to any algorithm allowed to modify its inputs, provided that those are restored to their initial state afterwards. This allows us to derive in-place accumulating algorithms for fast polynomial multiplications and for Strassen-like matrix multiplications. We then consider the simultaneously fast and in-place computation of the Euclidean polynomial remainder $R = A \bmod B$. If $A$ and $B$ have respective degree $m+n$ and $n$, and $M(k)$ denotes the complexity of a (not-in-place) algorithm to multiply two degree-$k$ polynomials, our algorithm uses at most $O((n/m) M(m)\log(m))$ arithmetic operations. If $M(n) = Θ(n^{1+ε})$ for some $ε>0$, then our algorithms do match the not-in-place complexity bound of $O((n/m) M(m))$. We also propose variants that compute - still in-place and with the same complexity bounds - $A = A \bmod B$, $R += A \bmod B$ and $R += AC \bmod B$, that is multiplication in a finite field extension. To achieve this, we develop techniques for Toeplitz matrix operations, for generalized convolutions, short product and power series division and remainder whose output is also part of the input.

Fast in-place accumulation

TL;DR

Abstract

extra space). The difficulty is to combine in-place computations with fast algorithms: those usually come at the expense of (potentially large) extra temporary space. We first propose a novel automatic design of fast and in-place accumulating algorithms for any bilinear formulae (and thus for polynomial and matrix multiplication) and then extend it to any linear accumulation of a collection of functions. For this, we relax the in-place model to any algorithm allowed to modify its inputs, provided that those are restored to their initial state afterwards. This allows us to derive in-place accumulating algorithms for fast polynomial multiplications and for Strassen-like matrix multiplications. We then consider the simultaneously fast and in-place computation of the Euclidean polynomial remainder

. If

and

have respective degree

and

, and

denotes the complexity of a (not-in-place) algorithm to multiply two degree-

polynomials, our algorithm uses at most

arithmetic operations. If

for some

, then our algorithms do match the not-in-place complexity bound of

. We also propose variants that compute - still in-place and with the same complexity bounds -

and

, that is multiplication in a finite field extension. To achieve this, we develop techniques for Toeplitz matrix operations, for generalized convolutions, short product and power series division and remainder whose output is also part of the input.

Paper Structure (46 sections, 27 theorems, 31 equations, 3 figures, 5 tables, 33 algorithms)

This paper contains 46 sections, 27 theorems, 31 equations, 3 figures, 5 tables, 33 algorithms.

Introduction
Computational model
Notations
Classical algorithms
In-place linear accumulation
In-place Strassen matrix multiplication with accumulation
In-place accumulating matrix multiplication with 7 recursive calls and 18 additions
In-place additive complexity
Fast over-place linear algebra
Over-place TRMM and TRSM
Over-place PLUQ
Over-place KERN, INVT and INV
Fast in-place square and rank-k update
In-place SQUARE
In-place SYRK
...and 31 more sections

Key Result

Theorem 1

alg:bilin is correct, in-place, and requires $t$MUL, $2(\#\alpha+\#\beta+\#\mu)-5t$ADD and $2(\sharp\alpha+\sharp\beta+\sharp\mu)$SCA operations.

Figures (3)

Figure 1: Optimal In-place accumulating Strassen-Winograd MM.
Figure 2: In-place Karatsuba polynomial multiplication.
Figure 3: Main polynomial reductions in the paper.

Theorems & Definitions (70)

Remark 1
Example 1
Remark 2
Theorem 1
proof
Remark 3
Theorem 2
Lemma 1
proof
Lemma 2
...and 60 more

Fast in-place accumulation

TL;DR

Abstract

Fast in-place accumulation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (70)