Performance Benchmarking of Tensor Trains for accelerated Quantum-Inspired Homogenization on TPU, GPU and CPU architectures
Sascha H. Hauck, Matthias Kabel, Nicolas R. Gauger
TL;DR
The paper tackles the prohibitive memory and compute demands of FFT-based homogenization for ultra-high-resolution microstructures by adopting a tensor-train (TT)–based, quantum-inspired SFFT approach. It presents a cross-platform benchmarking study of fundamental TT operations on CPUs, GPUs, and TPUs using JAX, and adapts the SFFT homogenization pipeline for accelerator architectures. The authors demonstrate up to 10× speed-ups over CPU implementations and show that GPUs and TPUs offer comparable performance in realistic scenarios, albeit with distinct stability and memory characteristics. They introduce a coarse-graining strategy to mitigate JIT warm-up costs and outline practical guidance for deploying TT-based homogenization at industrial scales, highlighting the complementary strengths of GPUs and TPUs for different TT workloads.
Abstract
Recent advances in high-resolution CT-imaging technology are creating a new class of ultra-high resolved micro-structural datasets that challenge the limits of traditional homogenization approaches. While state-of-the-art FFT-based homogenization techniques remain effective for moderate datasets, their memory footprint and computational cost grow rapidly with increasing resolution, making them increasingly inefficient for industrial-scale problems. To address these challenges, the recently developed Superfast-Fourier Transform (SFFT)-based homogenization algorithm leverages the memory-efficient low-rank representations of Tensor Trains (TTs), which reduce the storage and computational requirements of large-scale homogenization problems. Developed for CPU usage, SFFT-based Homogenization efficiently handles high-resolution datasets, assuming the underlying data is well-behaved. In this work, we investigate the performance of fundamental TT operations on modern hardware accelerators using the JAX framework. This benchmarking study, comparing CPUs, GPUs, and TPUs, evaluates execution times and computational efficiency. Building on these insights, we adapt the SFFT-based homogenization algorithm for usage on accelerators, achieving speed-ups of up to 10x relative to the CPU implementation, thus paving the road for the treatment of previously infeasible dataset sizes. Our results show that GPUs and TPUs achieve comparable performance in realistic scenarios, despite the relative immaturity of the TPU ecosystem, demonstrating the potential of both architectures to accelerate quantum-inspired techniques for industrial-scale simulations, particularly for homogenization problems.
