Neural Network Surrogate Model for Junction Temperature and Hotspot Position in $3$D Multi-Layer High Bandwidth Memory (HBM) Chiplets under Varying Thermal Conditions
Chengxin Zhang, Yujie Liu, Quan Chen
TL;DR
The paper tackles thermal management challenges in multilayer TSV-based 3D HBM chiplets by developing a neural-network surrogate trained on FEM-generated data to predict junction temperature and hotspot position under a wide, high-dimensional parameter space. It introduces two architectures (1NN2out and 2NN2out), uses partial Latin hypercube sampling to manage the combinatorial space, and validates an efficient TSV-equivalence model to accelerate simulations. The results demonstrate high accuracy (max temperature error around $2.0^{ acksup} ext{°C}$ and hotspot position error around $1~ ext{μm}$) and fast inference ($ oughly 9 imes 10^{-4}$ s per case), with strong generalization to unseen thermal conditions and HTC values, achieving up to $98.41 ext{%}$ alignment with FEM benchmarks. This surrogate enables rapid thermal design optimization in HPC and AI accelerators, reducing reliance on expensive FE analyses during the early design stages.
Abstract
As the demand for computational power increases, high-bandwidth memory (HBM) has become a critical technology for next-generation computing systems. However, the widespread adoption of HBM presents significant thermal management challenges, particularly in multilayer through-silicon-via (TSV) stacked structures under varying thermal conditions, where accurate prediction of junction temperature and hotspot position is essential during the early design. This work develops a data-driven neural network model for the fast prediction of junction temperature and hotspot position in 3D HBM chiplets. The model, trained with a data set of $13,494$ different combinations of thermal condition parameters, sampled from a vast parameter space characterized by high-dimensional combination (up to $3^{27}$), can accurately and quickly infer the junction temperature and hotspot position for any thermal conditions in the parameter space. Moreover, it shows good generalizability for other thermal conditions not considered in the parameter space. The data set is constructed using accurate finite element solvers. This method not only minimizes the reliance on costly experimental tests and extensive computational resources for finite element analysis but also accelerates the design and optimization of complex HBM systems, making it a valuable tool for improving thermal management and performance in high-performance computing applications.
