Table of Contents
Fetching ...

A faster multipole Legendre-Chebyshev transform

Mikael Mortensen

Abstract

This paper describes a fast algorithm for transforming Legendre coefficients into Chebyshev coefficients, and vice versa. The algorithm is based on the fast multipole method and is similar to the approach described by Alpert and Rokhlin [SIAM J. Sci. Comput., 12 (1991)]. The main difference is that we utilise a modal Galerkin approach with Chebyshev basis functions instead of a nodal approach with a Lagrange basis. Part of the algorithm is a novel method that facilitates faster spreading of intermediate results through neighbouring levels of hierarchical matrices. This enhancement leads to a method that is approximately 20% faster to execute, due to less floating point operations. We also describe an efficient initialization algorithm that for the Lagrange basis is roughly 5 times faster than the original method for large input arrays. The described method has both a planning and execution stage that asymptotically require O(N) flops. The algorithm is simple enough that it can be implemented in 100 lines of vectorized Python code. Moreover, its efficiency is such that a single-threaded C implementation can transform 1000,000 coefficients in approximately 20 milliseconds on a new MacBook Pro M3, representing about 3 times the execution time of a well-planned (single-threaded) type 2 discrete cosine transform from FFTW (www.fftw.org). Planning for the 1000,000 coefficients requires approximately 50 milliseconds.

A faster multipole Legendre-Chebyshev transform

Abstract

This paper describes a fast algorithm for transforming Legendre coefficients into Chebyshev coefficients, and vice versa. The algorithm is based on the fast multipole method and is similar to the approach described by Alpert and Rokhlin [SIAM J. Sci. Comput., 12 (1991)]. The main difference is that we utilise a modal Galerkin approach with Chebyshev basis functions instead of a nodal approach with a Lagrange basis. Part of the algorithm is a novel method that facilitates faster spreading of intermediate results through neighbouring levels of hierarchical matrices. This enhancement leads to a method that is approximately 20% faster to execute, due to less floating point operations. We also describe an efficient initialization algorithm that for the Lagrange basis is roughly 5 times faster than the original method for large input arrays. The described method has both a planning and execution stage that asymptotically require O(N) flops. The algorithm is simple enough that it can be implemented in 100 lines of vectorized Python code. Moreover, its efficiency is such that a single-threaded C implementation can transform 1000,000 coefficients in approximately 20 milliseconds on a new MacBook Pro M3, representing about 3 times the execution time of a well-planned (single-threaded) type 2 discrete cosine transform from FFTW (www.fftw.org). Planning for the 1000,000 coefficients requires approximately 50 milliseconds.

Paper Structure

This paper contains 22 sections, 2 theorems, 67 equations, 5 figures, 1 table, 5 algorithms.

Key Result

Lemma 3.1

\newlabellem:blockflops0 The significant flop count for a matrix vector product over a block is

Figures (5)

  • Figure 1: Example of a decomposition of the matrix $\boldsymbol{A}$ using three levels. Level 0 is the one with the fewest blocks, and then levels 1 and 2 have gradually more. For levels 0 and 1 the local block numbers are given in the upper left corner of each square, whereas the local indices for a submatrix on a block are shown in parenthesis. For level 2 only the local block number is printed.
  • Figure 1: The normalized maximum error for a consecutive L2C and C2L transforms. The black solid curve is using uniform random input data (r=0) and the dashed curve is using decaying, random Legendre coefficients scaled with r=1/2.
  • Figure 2: Example of a local part of the matrix $\boldsymbol{A}$, showing two neighbouring levels. There is one block on level $\gamma$, with block number $b$. The block numbers on level $\gamma+1$ are shown in upper left corners as computed from $b$. Also shown are the reference domains $X$ and $Y$ on level $\gamma$. For submatrix $(0, 0)$ on level $\gamma$, where the upper left corner is at global indices $i, j$, we also show the subintervals $I(0)$ and $I(1)$ used in Eq. \ref{['eq:wtk']}.
  • Figure 2: In (a) the black solid and dash-dotted curves show the execution times in seconds for a Legendre to Chebyshev transform of the current FMM (m) on the MacBook and SAGA computers, respectively. The loosely dotted curve shows the execution time (Mac) for the nodal FMM (n). The large black dots show the execution times of the triangular-banded divide and conquer (TDC) method from Table 4.3 of Olver2020. The execution time (Mac) for a type 2 DCT (FFTW) is shown in gray. In (b) the planning time (Mac) for the FMM (m) is shown as a black dotted line, for FMM (n) in a dash-dotted line and for TDC Olver2020 in large dots. Three dashed lines are shown with increasing slope to illustrate the asymptotic behaviour for $\mathcal{O}(N), \mathcal{O}(N (\log N)^2)$ and $\mathcal{O}(N^2)$. The asymptotic curves are identical in (a) and (b).
  • Figure 3: Speedup measured for the L2C method is shown in black solid, dotted and dashed lines for 2, 4 and 8 CPU cores, respectively. For FFTW's DCT type II the corresponding measurements are shown as squares, circles and triangles.

Theorems & Definitions (4)

  • Lemma 3.1
  • Proof 1
  • Lemma 4.1
  • Proof 2