Reliable edge machine learning hardware for scientific applications

Tommaso Baldi; Javier Campos; Ben Hawks; Jennifer Ngadiuba; Nhan Tran; Daniel Diaz; Javier Duarte; Ryan Kastner; Andres Meza; Melissa Quinnan; Olivia Weng; Caleb Geniesse; Amir Gholami; Michael W. Mahoney; Vladimir Loncar; Philip Harris; Joshua Agar; Shuyu Qin

Reliable edge machine learning hardware for scientific applications

Tommaso Baldi, Javier Campos, Ben Hawks, Jennifer Ngadiuba, Nhan Tran, Daniel Diaz, Javier Duarte, Ryan Kastner, Andres Meza, Melissa Quinnan, Olivia Weng, Caleb Geniesse, Amir Gholami, Michael W. Mahoney, Vladimir Loncar, Philip Harris, Joshua Agar, Shuyu Qin

TL;DR

The paper addresses the challenge of deploying reliable edge ML for high-rate scientific experiments by introducing a bit-accurate validation pipeline, quantization-robustness metrics, and fault-tolerance strategies tailored to extreme-edge hardware. It centers on the ECON-T CMS trigger case, using hls4ml to produce a functionally bit-accurate C model for large-scale simulation, and analyzes quantized loss landscapes via metrics like $CKA$ similarity and Hessian eigenvalues to understand training stability under $2$- to $8$-bit weights. It demonstrates that quantization can regularize training and that a large model can achieve fault tolerance with a small subset of sensitive bits, guiding a trade-off between protection and resources. The work lays out an end-to-end workflow—from bit-accurate functional simulation to loss-landscape diagnostics and targeted fault-mitigation techniques—that can inform robust, autonomous scientific experimentation for accelerated discovery.

Abstract

Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to developing and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for developing robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of developing autonomous scientific experimentation methods for accelerated scientific discovery.

Reliable edge machine learning hardware for scientific applications

TL;DR

similarity and Hessian eigenvalues to understand training stability under

- to

-bit weights. It demonstrates that quantization can regularize training and that a large model can achieve fault tolerance with a small subset of sensitive bits, guiding a trade-off between protection and resources. The work lays out an end-to-end workflow—from bit-accurate functional simulation to loss-landscape diagnostics and targeted fault-mitigation techniques—that can inform robust, autonomous scientific experimentation for accelerated discovery.

Abstract

Paper Structure (7 sections, 4 figures)

This paper contains 7 sections, 4 figures.

Motivation
Exemplar application and previous work
Methodology and metrics
Accurate and fast functional simulation
Quantized NN loss landscapes
Fault tolerance to bit flips and sensor noise
Outlook

Figures (4)

Figure 1: Many scientific and edge ML benchmark tasks deiana2022applications must process incoming data at a high rate leading to extreme low-latency and high-bandwidth requirements. Applications illustrated here range across particle physics (LHC, DUNE), nuclear physics (EIC), material science (X-ray diffraction, microscopy), neuroscience, fusion energy, quantum information science, superconducting magnet research, and particle accelerators. This can be compared against traditional internet-of-things and mobile device applications which are less stringent.
Figure 2: ECON-T model loss landscapes illustrating varying behaviors with different uniform quantizations between 2-bit and 8-bits. A range of performance can be seen from very jagged landscapes at 2-bit weights to relatively smooth landscapes at 4- and 6-bit weights to a sharp narrow minima for 8-bit weights.
Figure 3: Results achieved by different versions, in terms of hyperparameters, of the ECON-T model. The left heat map shows the EMD achieved by the model on noisy data. The right heat map hows the top eigenvalue, in logarithmic scale.
Figure 4: The plot illustrates the performance of ECON-T models under 5% noisy data trained with different values of $\lambda_{JR}$, the hyperparameter used to tune the weight of the Jacobian regularization component

Reliable edge machine learning hardware for scientific applications

TL;DR

Abstract

Reliable edge machine learning hardware for scientific applications

Authors

TL;DR

Abstract

Table of Contents

Figures (4)