Table of Contents
Fetching ...

High-Performance Tensor Contraction without Transposition

Devin A. Matthews

TL;DR

The paper addresses the high‑cost, layout‑sensitive problem of tensor contraction by proposing native contraction algorithms that operate directly on general tensors without explicit transpositions. It introduces scatter‑matrix and block‑scatter‑matrix layouts that map tensor data into matrix form, enabling BLIS‑style packing, tiling, and micro‑kernels to be reused for tensors. The SMTC and especially the BS MT C algorithms achieve performance close to matrix multiplication and outperform traditional TTGT in many cases, with strong multithreading scalability and no external workspace. These advances, implemented in the TBLIS framework, offer a practical, high‑performance alternative for diverse tensor contractions in scientific computing.

Abstract

Tensor computations--in particular tensor contraction (TC)--are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLIS framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor-to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in scientific applications.

High-Performance Tensor Contraction without Transposition

TL;DR

The paper addresses the high‑cost, layout‑sensitive problem of tensor contraction by proposing native contraction algorithms that operate directly on general tensors without explicit transpositions. It introduces scatter‑matrix and block‑scatter‑matrix layouts that map tensor data into matrix form, enabling BLIS‑style packing, tiling, and micro‑kernels to be reused for tensors. The SMTC and especially the BS MT C algorithms achieve performance close to matrix multiplication and outperform traditional TTGT in many cases, with strong multithreading scalability and no external workspace. These advances, implemented in the TBLIS framework, offer a practical, high‑performance alternative for diverse tensor contractions in scientific computing.

Abstract

Tensor computations--in particular tensor contraction (TC)--are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLIS framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor-to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in scientific applications.

Paper Structure

This paper contains 19 sections, 17 equations, 8 figures.

Figures (8)

  • Figure 1: The structure of a matrix multiplication operation using the BLIS approach. Figure from https://github.com/flame/blis/wiki/Multithreading, used with permission.
  • Figure 2: Schematic implementation of the TTGT approach for tensor contraction. Notational details explained in text.
  • Figure 3: Example of a block-scatter-matrixlayout (see \ref{['subsec:bsm']}) for the tensor $\mathscr{C}_{abcde}\in\mathbb{R}^{6\times3\times2\times3\times4}$ with a general column-major data layout (giving strides of 1, 6, 18, 36, and 108). The matrix representation is $C_{\bar{I}\bar{J}}$ for the bundles $I=cdb$ and $J=ae$. The blocking parameters are $m_{R}=n_{R}=4.$ Note that in this case the dimensions $c$ and $d$ are sequentially contiguous and so a regular stride can be maintained for larger blocks.
  • Figure 4: Handling of tensor layout types in important BLIS kernels. Each layout is assumed to refer to a tensor $\mathscr{T}_{\pi_{T}(UV)}$ (one of $\mathscr{A}$, $\mathscr{B}$, or $\mathscr{C}$) for some index bundles $U$ and $V$ and its matrix representation $T_{\bar{U}\bar{V}}$, while matrix layouts refer to a general matrix $M_{uv}$. Note that packing of block-scatter-matrix layouts may also take advantage of cases where only one of $rbs(\mathscr{T})$ and $cbs(\mathscr{T})$ indicates a constant stride.
  • Figure 5: Variadic template implementation of BSMTC. The steps specified in the GEMM<...> template can be directly compared to those in \ref{['fig:blis']} with the addition of tensor "matrification" (conversion from TensorMatrix<T> to BlockScatterMatrix<T>).
  • ...and 3 more figures