Table of Contents
Fetching ...

mLR: Scalable Laminography Reconstruction based on Memoization

Bin Ma, Viktor Nikitin, Xi Wang, Tekin Bicer, Dong Li

TL;DR

This work addresses the prohibitive compute time and memory demands of ADMM-FFT in laminography by introducing mLR, a memoization-driven acceleration framework. Key ideas include replacing expensive FFT calls with cached results, a CNN-based encoder to map FFT inputs to compact keys, and a distributed memoization layer backed by Faiss and Redis, complemented by ADMM-Offload to store large variables on NVMe SSDs. The approach achieves substantial scalability and performance gains, demonstrated by a $52.8\%$ average speedup (up to $65.4\%$) on a $2K\times2K\times2K$ laminate, enabling previously memory-limited laminography reconstructions. The solution effectively balances memory savings, computation, and convergence quality, providing a practical path to large-volume 3D imaging on memory-constrained HPC systems, with code open-sourced for reproducibility.

Abstract

ADMM-FFT is an iterative method with high reconstruction accuracy for laminography but suffers from excessive computation time and large memory consumption. We introduce mLR, which employs memoization to replace the time-consuming Fast Fourier Transform (FFT) operations based on an unique observation that similar FFT operations appear in iterations of ADMM-FFT. We introduce a series of techniques to make the application of memoization to ADMM-FFT performance-beneficial and scalable. We also introduce variable offloading to save CPU memory and scale ADMM-FFT across GPUs within and across nodes. Using mLR, we are able to scale ADMM-FFT on an input problem of 2Kx2Kx2K, which is the largest input problem laminography reconstruction has ever worked on with the ADMM-FFT solution on limited memory; mLR brings 52.8% performance improvement on average (up to 65.4%), compared to the original ADMM-FFT.

mLR: Scalable Laminography Reconstruction based on Memoization

TL;DR

This work addresses the prohibitive compute time and memory demands of ADMM-FFT in laminography by introducing mLR, a memoization-driven acceleration framework. Key ideas include replacing expensive FFT calls with cached results, a CNN-based encoder to map FFT inputs to compact keys, and a distributed memoization layer backed by Faiss and Redis, complemented by ADMM-Offload to store large variables on NVMe SSDs. The approach achieves substantial scalability and performance gains, demonstrated by a average speedup (up to ) on a laminate, enabling previously memory-limited laminography reconstructions. The solution effectively balances memory savings, computation, and convergence quality, providing a practical path to large-volume 3D imaging on memory-constrained HPC systems, with code open-sourced for reproducibility.

Abstract

ADMM-FFT is an iterative method with high reconstruction accuracy for laminography but suffers from excessive computation time and large memory consumption. We introduce mLR, which employs memoization to replace the time-consuming Fast Fourier Transform (FFT) operations based on an unique observation that similar FFT operations appear in iterations of ADMM-FFT. We introduce a series of techniques to make the application of memoization to ADMM-FFT performance-beneficial and scalable. We also introduce variable offloading to save CPU memory and scale ADMM-FFT across GPUs within and across nodes. Using mLR, we are able to scale ADMM-FFT on an input problem of 2Kx2Kx2K, which is the largest input problem laminography reconstruction has ever worked on with the ADMM-FFT solution on limited memory; mLR brings 52.8% performance improvement on average (up to 65.4%), compared to the original ADMM-FFT.

Paper Structure

This paper contains 26 sections, 8 equations, 17 figures, 1 table, 2 algorithms.

Figures (17)

  • Figure 1: Computation and communication pipeline for the operation $F_{u1D}$ in the existing work.
  • Figure 2: CPU memory consumption in one ADMM iteration.
  • Figure 3: mLR's execution pipeline for the operation $F_{u2D}$ with memoization
  • Figure 4: At a chunk location, similar chunks can appear across the iterations of ADMM-FFT.
  • Figure 5: LSP with and without operator fusion.
  • ...and 12 more figures