Table of Contents
Fetching ...

Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem

Purvish Jajal, Wenxin Jiang, Arav Tewari, Erik Kocinare, Joseph Woo, Anusha Sarraf, Yung-Hsiang Lu, George K. Thiruvathukal, James C. Davis

TL;DR

This work provides the first systematic analysis of failures in DL model converters, focusing on ONNX as the leading interoperability target. It combines a practitioner survey (N=92), a large-scale failure analysis of 200 ONNX converter issues (PyTorch and TensorFlow origins), and hypothesis testing on root causes. Key findings show that the Node Conversion stage accounts for roughly 75% of defects and that about 33% of failures involve semantically incorrect models, with failures linked to compatibility and type problems and correlated with operator sequences rather than single operators. The study highlights the need for improved validation, coverage metrics, and tolerance-aware testing to enhance the reliability of DL interoperability tooling and suggests directions for future research in architectural coverage and behavioral tolerances.

Abstract

Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful.

Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem

TL;DR

This work provides the first systematic analysis of failures in DL model converters, focusing on ONNX as the leading interoperability target. It combines a practitioner survey (N=92), a large-scale failure analysis of 200 ONNX converter issues (PyTorch and TensorFlow origins), and hypothesis testing on root causes. Key findings show that the Node Conversion stage accounts for roughly 75% of defects and that about 33% of failures involve semantically incorrect models, with failures linked to compatibility and type problems and correlated with operator sequences rather than single operators. The study highlights the need for improved validation, coverage metrics, and tolerance-aware testing to enhance the reliability of DL interoperability tooling and suggests directions for future research in architectural coverage and behavioral tolerances.

Abstract

Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful.
Paper Structure (43 sections, 8 figures, 12 tables)

This paper contains 43 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Paths from model development to deployment on hardware. Model interoperability facilitates reuse across frameworks and deployment environments davis2023reusing. A represents model conversion to a common intermediary. B represents compilation. C represents model deployment. D represents model conversion to a framework.
  • Figure 2: PyTorch model converted to ONNX Intermediate Representation. The PyTorch model calculates the per-row maximum using torch.max. In ONNX, this uses the operators ArgMax plus ReduceMax.
  • Figure 3: Goal, research questions, methods, and data sources.
  • Figure 4: Filtering of issues for each repository studied. Filters are using GitHub search predicates. Commit/PR filters are applied to issue timeline events. Data were collected on Jan. 6, 2023. 100 issues per repository were analyzed.
  • Figure 5: Additions and updates of operators by ONNX operator set version, from version 1 (2017)---version 18 (2022). Version size := sum of operator additions and updates.
  • ...and 3 more figures