Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

Pedro Reviriego; Ziheng Wang; Alvaro Alonso; Zhen Gao; Farzad Niknia; Shanshan Liu; Fabrizio Lombardi

Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

Pedro Reviriego, Ziheng Wang, Alvaro Alonso, Zhen Gao, Farzad Niknia, Shanshan Liu, Fabrizio Lombardi

TL;DR

The paper tackles the reliability of large-scale ML systems under transient soft errors by proposing Concurrent Classifier Error Detection (CCED), which uses a lightweight concurrent classifier monitoring a small set of internal signals to detect output-changing errors. By training the concurrent classifier on error-free and error-induced patterns and running it in parallel with the main model, CCED achieves high detection rates with minimal overhead, and can trigger a re-run inference to correct soft errors. Evaluations on CLIP and BERT across multiple datasets show detection above the majority of errors with modest recomputation budgets, especially when the main model has high accuracy. CCED thus offers a scalable, model-agnostic approach to error detection that integrates with ML design flows and has practical implications for safety-critical deployments.

Abstract

The complexity of Machine Learning (ML) systems increases each year, with current implementations of large language models or text-to-image generators having billions of parameters and requiring billions of arithmetic operations. As these systems are widely utilized, ensuring their reliable operation is becoming a design requirement. Traditional error detection mechanisms introduce circuit or time redundancy that significantly impacts system performance. An alternative is the use of Concurrent Error Detection (CED) schemes that operate in parallel with the system and exploit their properties to detect errors. CED is attractive for large ML systems because it can potentially reduce the cost of error detection. In this paper, we introduce Concurrent Classifier Error Detection (CCED), a scheme to implement CED in ML systems using a concurrent ML classifier to detect errors. CCED identifies a set of check signals in the main ML system and feeds them to the concurrent ML classifier that is trained to detect errors. The proposed CCED scheme has been implemented and evaluated on two widely used large-scale ML models: Contrastive Language Image Pretraining (CLIP) used for image classification and Bidirectional Encoder Representations from Transformers (BERT) used for natural language applications. The results show that more than 95 percent of the errors are detected when using a simple Random Forest classifier that is order of magnitude simpler than CLIP or BERT. These results illustrate the potential of CCED to implement error detection in large-scale ML models.

Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

TL;DR

Abstract

Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (9)