Performance of linear solvers in tensor-train format on current multicore architectures

Melven Röhrig-Zöllner; Manuel Joey Becklas; Jonas Thies; Achim Basermann

Performance of linear solvers in tensor-train format on current multicore architectures

Melven Röhrig-Zöllner, Manuel Joey Becklas, Jonas Thies, Achim Basermann

TL;DR

This paper discusses the performance of solvers for low-rank linear systems in the tensor-train format (also known as matrix-product states), and proposes a generic preconditioner based on a TT-rank-1 approximation of the linear operator.

Abstract

Tensor networks are a class of algorithms aimed at reducing the computational complexity of high-dimensional problems. They are used in an increasing number of applications, from quantum simulations to machine learning. Exploiting data parallelism in these algorithms is key to using modern hardware. However, there are several ways to map required tensor operations onto linear algebra routines ("building blocks"). Optimizing this mapping impacts the numerical behavior, so computational and numerical aspects must be considered hand-in-hand. In this paper we discuss the performance of solvers for low-rank linear systems in the tensor-train format (also known as matrix-product states). We consider three popular algorithms: TT-GMRES, MALS, and AMEn. We illustrate their computational complexity based on the example of discretizing a simple high-dimensional PDE in, e.g., $50^{10}$ grid points. This shows that the projection to smaller sub-problems for MALS and AMEn reduces the number of floating-point operations by orders of magnitude. We suggest optimizations regarding orthogonalization steps, singular value decompositions, and tensor contractions. In addition, we propose a generic preconditioner based on a TT-rank-1 approximation of the linear operator. Overall, we obtain roughly a 5x speedup over the reference algorithm for the fastest method (AMEn) on a current multicore CPU.

Performance of linear solvers in tensor-train format on current multicore architectures

TL;DR

Abstract

grid points. This shows that the projection to smaller sub-problems for MALS and AMEn reduces the number of floating-point operations by orders of magnitude. We suggest optimizations regarding orthogonalization steps, singular value decompositions, and tensor contractions. In addition, we propose a generic preconditioner based on a TT-rank-1 approximation of the linear operator. Overall, we obtain roughly a 5x speedup over the reference algorithm for the fastest method (AMEn) on a current multicore CPU.

Paper Structure (32 sections, 64 equations, 4 figures, 5 algorithms)

This paper contains 32 sections, 64 equations, 4 figures, 5 algorithms.

Introduction
Background and notation
Numerical background
Matrix decompositions
Tensor-train decomposition
Tensor unfolding and orthogonalities
Tensor-train vectors and operators
Performance characteristics on today's multicore CPU systems
Roofline performance model
Memory and cache performance
Numerical algorithms
Krylov methods: TT-GMRES
Arithmetic operations in tensor-train format
Improved Gram-Schmidt orthogonalization
Tensor-train ranks for problems with a displacement structure
...and 17 more sections

Figures (4)

Figure 1: Tensor-train ranks for the Krylov basis, respectively the approximate solution for a $20^{10}$ convection-diffusion problem ($c=10$) and RHS $B_\text{TT}$ of ones. For TT-GMRES (left), both MGS variants lead to inaccurate solutions that are not within the desired residual tolerance in contrast to all cases with SIMGS. Overall, more accurate orthogonalization (SIMGS) without restart and preconditioning features the lowest maximal ranks during the calculation. For MALS (right), the solution ranks only increase slowly with each sweep (as intended), but the Krylov basis vectors of the inner iteration again yield higher ranks.
Figure 2: Number of floating-point operations measured using likwid Treibig2010 for a convection-diffusion problem ($c=10$). Dashed lines use the TT-rank-1 preconditioner. Dotted lines first transform the problem to the QTT format Khoromskij2011. In all cases, AMEn requires orders of magnitude fewer operations than MALS and TT-GMRES.
Figure 3: Effect of building block optimizations: For adding two tensors in the tensor-train format (left), we obtain a speedup of ${\sim}3.5$ by mapping the calculation onto faster linear algebra operations as explained in \ref{['sec:building_blocks_svd_and_qr']} and \ref{['sec:tt_axpby_exploiting_orthogonalities']}. For applying the linear operator of the inner problem in AMEn (right), we obtain a speedup of ${\sim}3$ through directly calling optimized BLAS routines and through reordering array dimensions.
Figure 4: Timings for TT-AMEn for solving a linear system from a $50^{10}$ convection-diffusion problem ($c=10$) and random RHS $B_\text{TT}$ with varying ranks. Dashed lines use the TT-rank-1 preconditioner. Dotted black lines illustrate the asymptotic complexity using the formula $c(0.35(r/700)^3+0.65(r/700)^2)$. The heuristic ALS variant (right) is about twice as fast as the full variant (left). For both variants, the time-to-solution is reduced by a factor of ${\sim}5$ by combining all suggested optimizations.

Theorems & Definitions (2)

Remark 1
Remark 2

Performance of linear solvers in tensor-train format on current multicore architectures

TL;DR

Abstract

Performance of linear solvers in tensor-train format on current multicore architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (2)