MFIT: Multi-Fidelity Thermal Modeling for 2.5D and 3D Multi-Chiplet Architectures
Lukas Pfromm, Alish Kanani, Harsh Sharma, Parth Solanki, Eric Tervo, Jaehyun Park, Janardhan Rao Doppa, Partha Pratim Pande, Umit Y. Ogras
TL;DR
MFIT introduces a unified, multi-fidelity framework for thermal modeling of 2.5D and 3D chiplet architectures, spanning fine-grained FEM, abstract FEM, thermal RC, and discrete-state-space models. Starting from a high-fidelity FEM reference, MFIT systematically derives faster abstractions (with controlled accuracy losses such as <0.5$^{\circ}$C for abstract FEM and <1.7$^{\circ}$C for RC) to support design space exploration, and finally runtime management via DSS models that run in milliseconds. Comprehensive evaluation on 16, 36, and 64 2.5D chiplets and a 16×3 3D stack demonstrates substantial speedups (days to seconds or milliseconds) with FEM-like accuracy, and the framework is released as open source. The approach enables thermally-aware architectural optimization and runtime cooling decisions for heterogeneous 2.5D/3D packages, including MI300A-scale systems.
Abstract
Rapidly evolving artificial intelligence and machine learning applications require ever-increasing computational capabilities, while monolithic 2D design technologies approach their limits. Heterogeneous integration of smaller chiplets using a 2.5D silicon interposer and 3D packaging has emerged as a promising paradigm to address this limit and meet performance demands. These approaches offer a significant cost reduction and higher manufacturing yield than monolithic 2D integrated circuits. However, the compact arrangement and high compute density exacerbate the thermal management challenges, potentially compromising performance. Addressing these thermal modeling challenges is critical, especially as system sizes grow and different design stages require varying levels of accuracy and speed. Since no single thermal modeling technique meets all these needs, this paper introduces MFIT, a range of multi-fidelity thermal models that effectively balance accuracy and speed. These multi-fidelity models can enable efficient design space exploration and runtime thermal management. Our extensive testing on systems with 16, 36, and 64 2.5D integrated chiplets and 16x3 3D integrated chiplets demonstrates that these models can reduce execution times from days to mere seconds and milliseconds with negligible loss in accuracy.
