Table of Contents
Fetching ...

Communication efficient application of sequences of planar rotations to a matrix

Thijs Steel, Julien Langou

TL;DR

The paper tackles the efficient application of sequences of planar rotations to a matrix, a critical subroutine in eigenvalue-related algorithms. It introduces a memory- and cache-conscious approach combining a novel register-reuse kernel, blocking, and packing to minimize data movement, supported by analytical I/O and operation-cost assessments. The method achieves substantial speedups over state-of-the-art approaches and, on modern CPUs, can approach near-peak flop rates, particularly with a carefully chosen kernel size (e.g., $m_r=16$, $k_r=2$) and appropriate parallelization. The work has practical impact for implicit QR-like methods and could influence future BLAS/BLIS implementations and performance-oriented linear algebra on diverse architectures.

Abstract

We present an efficient algorithm for the application of sequences of planar rotations to a matrix. Applying such sequences efficiently is important in many numerical linear algebra algorithms for eigenvalues. Our algorithm is novel in three main ways. First, we introduce a new kernel that is optimized for register reuse in a novel way. Second, we introduce a blocking and packing scheme that improves the cache efficiency of the algorithm. Finally, we thoroughly analyze the memory operations of the algorithm which leads to important theoretical insights and makes it easier to select good parameters. Numerical experiments show that our algorithm outperforms the state-of-the-art and achieves a flop rate close to the theoretical peak on modern hardware.

Communication efficient application of sequences of planar rotations to a matrix

TL;DR

The paper tackles the efficient application of sequences of planar rotations to a matrix, a critical subroutine in eigenvalue-related algorithms. It introduces a memory- and cache-conscious approach combining a novel register-reuse kernel, blocking, and packing to minimize data movement, supported by analytical I/O and operation-cost assessments. The method achieves substantial speedups over state-of-the-art approaches and, on modern CPUs, can approach near-peak flop rates, particularly with a carefully chosen kernel size (e.g., , ) and appropriate parallelization. The work has practical impact for implicit QR-like methods and could influence future BLAS/BLIS implementations and performance-oriented linear algebra on diverse architectures.

Abstract

We present an efficient algorithm for the application of sequences of planar rotations to a matrix. Applying such sequences efficiently is important in many numerical linear algebra algorithms for eigenvalues. Our algorithm is novel in three main ways. First, we introduce a new kernel that is optimized for register reuse in a novel way. Second, we introduce a blocking and packing scheme that improves the cache efficiency of the algorithm. Finally, we thoroughly analyze the memory operations of the algorithm which leads to important theoretical insights and makes it easier to select good parameters. Numerical experiments show that our algorithm outperforms the state-of-the-art and achieves a flop rate close to the theoretical peak on modern hardware.

Paper Structure

This paper contains 24 sections, 11 equations, 8 figures, 4 algorithms.

Figures (8)

  • Figure 1: The matrix $C$ containing the cosines of the rotations and arrows indicating the order in which the rotations are applied. On the left, the standard pattern which applies full sequences of rotations. On the right, the wavefront pattern which applies the rotations in "waves".
  • Figure 1: Illustration of packing for the matrix $A$. The matrix on the left is stored in column-major order, the matrix on the right is stored in packed order.
  • Figure 1: Illustration of the blocking scheme. On the left, the matrix $A$ to whose columns the rotations are applied, and on the right, the matrix $C$ containing the cosines of the rotations. We do not show the matrix $S$ here because its blocks are identical to those of $C$. One of the blocks is indicated with diagonal lines. Notice how that block covers two of the rectangles in $A$: the blocks in $A$ overlap.
  • Figure 1: On top, the Flop rates of the different algorithms. On the bottom, the runtime of the different algorithms relative to rs_kernel_v2.
  • Figure 2: Illustration of the application of a block of rotations using the kernel. On the left, the matrix $A$ to whose columns the rotations are applied, and on the right, the matrix $C$ containing the cosines of the rotations. We do not show the matrix $S$ here because its blocks are identical to those of $C$.
  • ...and 3 more figures