Table of Contents
Fetching ...

Performance Benchmarking of Tensor Trains for accelerated Quantum-Inspired Homogenization on TPU, GPU and CPU architectures

Sascha H. Hauck, Matthias Kabel, Nicolas R. Gauger

TL;DR

The paper tackles the prohibitive memory and compute demands of FFT-based homogenization for ultra-high-resolution microstructures by adopting a tensor-train (TT)–based, quantum-inspired SFFT approach. It presents a cross-platform benchmarking study of fundamental TT operations on CPUs, GPUs, and TPUs using JAX, and adapts the SFFT homogenization pipeline for accelerator architectures. The authors demonstrate up to 10× speed-ups over CPU implementations and show that GPUs and TPUs offer comparable performance in realistic scenarios, albeit with distinct stability and memory characteristics. They introduce a coarse-graining strategy to mitigate JIT warm-up costs and outline practical guidance for deploying TT-based homogenization at industrial scales, highlighting the complementary strengths of GPUs and TPUs for different TT workloads.

Abstract

Recent advances in high-resolution CT-imaging technology are creating a new class of ultra-high resolved micro-structural datasets that challenge the limits of traditional homogenization approaches. While state-of-the-art FFT-based homogenization techniques remain effective for moderate datasets, their memory footprint and computational cost grow rapidly with increasing resolution, making them increasingly inefficient for industrial-scale problems. To address these challenges, the recently developed Superfast-Fourier Transform (SFFT)-based homogenization algorithm leverages the memory-efficient low-rank representations of Tensor Trains (TTs), which reduce the storage and computational requirements of large-scale homogenization problems. Developed for CPU usage, SFFT-based Homogenization efficiently handles high-resolution datasets, assuming the underlying data is well-behaved. In this work, we investigate the performance of fundamental TT operations on modern hardware accelerators using the JAX framework. This benchmarking study, comparing CPUs, GPUs, and TPUs, evaluates execution times and computational efficiency. Building on these insights, we adapt the SFFT-based homogenization algorithm for usage on accelerators, achieving speed-ups of up to 10x relative to the CPU implementation, thus paving the road for the treatment of previously infeasible dataset sizes. Our results show that GPUs and TPUs achieve comparable performance in realistic scenarios, despite the relative immaturity of the TPU ecosystem, demonstrating the potential of both architectures to accelerate quantum-inspired techniques for industrial-scale simulations, particularly for homogenization problems.

Performance Benchmarking of Tensor Trains for accelerated Quantum-Inspired Homogenization on TPU, GPU and CPU architectures

TL;DR

The paper tackles the prohibitive memory and compute demands of FFT-based homogenization for ultra-high-resolution microstructures by adopting a tensor-train (TT)–based, quantum-inspired SFFT approach. It presents a cross-platform benchmarking study of fundamental TT operations on CPUs, GPUs, and TPUs using JAX, and adapts the SFFT homogenization pipeline for accelerator architectures. The authors demonstrate up to 10× speed-ups over CPU implementations and show that GPUs and TPUs offer comparable performance in realistic scenarios, albeit with distinct stability and memory characteristics. They introduce a coarse-graining strategy to mitigate JIT warm-up costs and outline practical guidance for deploying TT-based homogenization at industrial scales, highlighting the complementary strengths of GPUs and TPUs for different TT workloads.

Abstract

Recent advances in high-resolution CT-imaging technology are creating a new class of ultra-high resolved micro-structural datasets that challenge the limits of traditional homogenization approaches. While state-of-the-art FFT-based homogenization techniques remain effective for moderate datasets, their memory footprint and computational cost grow rapidly with increasing resolution, making them increasingly inefficient for industrial-scale problems. To address these challenges, the recently developed Superfast-Fourier Transform (SFFT)-based homogenization algorithm leverages the memory-efficient low-rank representations of Tensor Trains (TTs), which reduce the storage and computational requirements of large-scale homogenization problems. Developed for CPU usage, SFFT-based Homogenization efficiently handles high-resolution datasets, assuming the underlying data is well-behaved. In this work, we investigate the performance of fundamental TT operations on modern hardware accelerators using the JAX framework. This benchmarking study, comparing CPUs, GPUs, and TPUs, evaluates execution times and computational efficiency. Building on these insights, we adapt the SFFT-based homogenization algorithm for usage on accelerators, achieving speed-ups of up to 10x relative to the CPU implementation, thus paving the road for the treatment of previously infeasible dataset sizes. Our results show that GPUs and TPUs achieve comparable performance in realistic scenarios, despite the relative immaturity of the TPU ecosystem, demonstrating the potential of both architectures to accelerate quantum-inspired techniques for industrial-scale simulations, particularly for homogenization problems.

Paper Structure

This paper contains 21 sections, 43 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Quantum circuit for the four-qubit Quantum Fourier Transform, $\mathcal{QFT}_4 = \mathcal{R}_4 \mathcal{Q}_4$.
  • Figure 2: Parallelizable operations for TTs and TTOs. Columns correspond to the operation type: addition of two TTs (a,d), multiplication between two TTs (b,e), and contraction between a TT and a TTO (c,f). The first row shows runtime benchmarks, while the second row shows the Roofline model for GPU and TPU.
  • Figure 3: Serial operations for TTs. Columns correspond to the operation type: orthogonalization (a,d), SVD-based compression (b,e), and polar-based compression of a TT (c,f). The first row shows runtime benchmarks, while the second row shows the Roofline model for GPU and TPU.
  • Figure 4: (a) 2D box geometry used in the experiments. Material parameters: $E_1 = 29/3\,GPa$ and $E_2 = 4/3\,GPa$ with Poisson ratios $\nu_1 = \nu_2 = 1/3$ for the grey and white regions, respectively. (b) Local stress field $\sigma_{yy}$ under tensile load obtained after running the quantum-inspired homogenization algorithm on a TPU.
  • Figure 5: Benchmarking results for the SFFT-based homogenization algorithm: (a) average time per iteration on the CPU (NumPy), and (b) achieved speed-up on GPU and TPU (JAX) for increasing discretizations.
  • ...and 4 more figures