Switching Frequency as FPGA Monitor: Studying Degradation and Ageing Prognosis at Large Scale
Leandro Lanzieri, Lukasz Butkowski, Jiri Kral, Goerschwin Fey, Holger Schlarb, Thomas C. Schmidt
TL;DR
The paper tackles hardware ageing in unhardened embedded devices by conducting a large-scale, in-field study of 298 FPGA boards across 280 days, focusing on degradation of propagation delay measured via ring-oscillator sensors. It combines shutdown and continuous monitoring analyses to quantify degradation, identifies spatial hotspots correlating with usage patterns, and demonstrates time-series forecasting for predictive maintenance with mean errors as low as $0.002$ over a 60-day horizon. A robust anomaly-detection framework and comprehensive backtesting show the practical viability of deploying monitoringFactory-style degradation forecasts at scale. The work advances understanding of real-world ageing in distributed facilities and provides a foundation for automated, proactive maintenance and self-healing strategies.
Abstract
The growing deployment of unhardened embedded devices in critical systems demands the monitoring of hardware ageing as part of predictive maintenance. In this paper, we study degradation on a large deployment of 298 naturally aged FPGAs operating in the European XFEL particle accelerator. We base our statistical analyses on 280 days of in-field measurements and find a generalized and continuous degradation of the switching frequency across all devices with a median value of 0.064%. The large scale of this study allows us to localize areas of the deployed FPGAs that are highly impacted by degradation. Moreover, by training machine learning models on the collected data, we are able to forecast future trends of frequency degradation with horizons of 60 days and relative errors as little as 0.002% over an evaluation period of 100 days.
