Table of Contents
Fetching ...

A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication

Alexandre Siron, Sean Molesky

Abstract

Numeric modeling of electromagnetics and acoustics frequently entails matrix-vector multiplication with block Toeplitz structure. When the corresponding block Toeplitz matrix is not highly sparse, e.g. when considering the electromagnetic Green function in a spatial basis, such calculations are often carried out by performing a multilevel embedding that gives the matrix a fully circulant form. While this transformation allows the associated matrix-vector multiplication to be computed via Fast Fourier Transforms (FFTs) and diagonal multiplication, generally leading to dramatic performance improvements compared to naive multiplication, it also adds unnecessary information that increases memory consumption and reduces computational efficiency. As an improvement, we propose a lazy embedding, eager projection, algorithm that for dimensionality $d$, asymptotically reduces the number of needed computations $\propto d/ \left(2 - 2^{-d+1}\right)$ and peak memory usage $\propto 2/\left((d+1)2^{-d} + 1\right)$, generally, and $\propto\left(2^{d} + 1\right)/\left(d +2\right)$ for a fully symmetric or skew-symmetric systems. The structure of the algorithm suggests several simple approaches for parallelization of large block Toeplitz matrix-vector products across multiple devices and adds flexibility in memory and task management.

A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication

Abstract

Numeric modeling of electromagnetics and acoustics frequently entails matrix-vector multiplication with block Toeplitz structure. When the corresponding block Toeplitz matrix is not highly sparse, e.g. when considering the electromagnetic Green function in a spatial basis, such calculations are often carried out by performing a multilevel embedding that gives the matrix a fully circulant form. While this transformation allows the associated matrix-vector multiplication to be computed via Fast Fourier Transforms (FFTs) and diagonal multiplication, generally leading to dramatic performance improvements compared to naive multiplication, it also adds unnecessary information that increases memory consumption and reduces computational efficiency. As an improvement, we propose a lazy embedding, eager projection, algorithm that for dimensionality , asymptotically reduces the number of needed computations and peak memory usage , generally, and for a fully symmetric or skew-symmetric systems. The structure of the algorithm suggests several simple approaches for parallelization of large block Toeplitz matrix-vector products across multiple devices and adds flexibility in memory and task management.

Paper Structure

This paper contains 7 sections, 12 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Left: Translational invariance and block Toeplitz structure. The schematic displays the distribution of interaction coefficients for translationally invariant physics of a pair of simple grids. 0-step (self-interactions) are labeled in brown, vertical 1-step interactions in yellow and light green, horizontal 1-step interactions in blue and pink, 2-step interactions in darker green, and 3-step interactions in purple. Right: Block Toeplitz embedding and data padding. The cartoon illustrates the dilution of input vector data in the standard circulant embedding procedure as the dimensionality (level) of block Toeplitz structure increases. Input vector data is is indicated as the coloured portion of each sub-figure.
  • Figure 2: Left: Branching form of proposed algorithm. The left panel depicts the branching tree of data flow in 3D to perform block Toeplitz matrix-vector multiplication in the proposed algorithm, Alg. \ref{['alg:cap']}. Black dots are FFT tranformations, $P$ labelled arrows represent odd (phase modified) Fourier coefficients, and black horizontal arrows are diagonal multiplications with block Toeplitz matrix data. Right: Branch execution in recursive implementation. The schematic shows the particular parts of the data flow that are controlled by a given branch (bId) in reference to Alg. \ref{['alg:cap']}. Two different levels of toeMulBrn are displayed, green and purple boxes. Each level launches two function calls, and then merges the retruned results. Control is then returned to the calling level.
  • Figure 3: Comparsion of wall clock and operational complexity ratios. The figure depicts the wall clock time ratio of the Julia implementation of Alg. \ref{['alg:cap']} (https://github.com/alsirc/SplitFFT_lazyEmbed) compared to standard circulant embedding for vectors with lengths corresponding to powers of $2$. In moving from panel (a) to (d), the dimensionality of block Toeplitz structure is increased from $1$ (Toeplitz matrix) to $4$.

Theorems & Definitions (1)

  • proof