Reliable edge machine learning hardware for scientific applications
Tommaso Baldi, Javier Campos, Ben Hawks, Jennifer Ngadiuba, Nhan Tran, Daniel Diaz, Javier Duarte, Ryan Kastner, Andres Meza, Melissa Quinnan, Olivia Weng, Caleb Geniesse, Amir Gholami, Michael W. Mahoney, Vladimir Loncar, Philip Harris, Joshua Agar, Shuyu Qin
TL;DR
The paper addresses the challenge of deploying reliable edge ML for high-rate scientific experiments by introducing a bit-accurate validation pipeline, quantization-robustness metrics, and fault-tolerance strategies tailored to extreme-edge hardware. It centers on the ECON-T CMS trigger case, using hls4ml to produce a functionally bit-accurate C model for large-scale simulation, and analyzes quantized loss landscapes via metrics like $CKA$ similarity and Hessian eigenvalues to understand training stability under $2$- to $8$-bit weights. It demonstrates that quantization can regularize training and that a large model can achieve fault tolerance with a small subset of sensitive bits, guiding a trade-off between protection and resources. The work lays out an end-to-end workflow—from bit-accurate functional simulation to loss-landscape diagnostics and targeted fault-mitigation techniques—that can inform robust, autonomous scientific experimentation for accelerated discovery.
Abstract
Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to developing and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for developing robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of developing autonomous scientific experimentation methods for accelerated scientific discovery.
