Table of Contents
Fetching ...

Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators

Elliott Wen, Sean Ma, Ewan Tempero, Jens Dietrich, Daniel Luo, Jiaxing Shen, Kaiqi Zhao, Bruce Sham, Yousong Song, Jiayi Hua, Jia Hong

TL;DR

This paper addresses the problem of ensuring consistent machine learning behavior across heterogeneous AI accelerators. It introduces a differential testing pipeline that synthesizes 100,000 variant models from a corpus of real-world models and runs them on five enterprise accelerators, analyzing discrepancies via a data-flow tracing approach. Key findings show that newer platforms like Mac and Huawei support substantially fewer operators and exhibit higher output discrepancies, while compilation-based acceleration (e.g., TorchDynamo) is less reliable on these platforms, revealing 7 PyTorch bugs and 40 platform-specific issues. The work highlights reproducibility challenges in a diverse hardware landscape and provides actionable insights for reducing cross-platform divergence, including improving operator coverage and compiler robustness.

Abstract

While NVIDIA remains the dominant provider of AI accelerators within cloud data center, emerging vendors such as AMD, Intel, Mac, and Huawei offer cost-effective alternatives with claims of compatibility and performance. This paper presents the first empirical study investigating divergence in machine learning model across heterogeneous AI accelerators. Utilizing an automated pipeline, we synthesize over 100,000 variant models derived from 4,000 real-world models and execute them across five different enterprise-grade accelerators. Our findings suggest that newer AI platforms from Mac and Huawei support at least 17\% fewer operators than NVIDIA. These platforms also exhibit a higher rate of output discrepancies (exceeding 5\%), which stem from differences in operator implementations, handling of exceptional numerical values, and instruction scheduling. They are also more susceptible to failures during model compilation-based acceleration, and in some cases, the compiled models produce outputs that differ noticeably from those generated using the standard execution mode. In addition, we identify 7 implementation flaws in PyTorch and 40 platform-specific issues across vendors. These results underscore the challenges of achieving consistent machine learning behavior in an increasingly diverse hardware ecosystem.

Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators

TL;DR

This paper addresses the problem of ensuring consistent machine learning behavior across heterogeneous AI accelerators. It introduces a differential testing pipeline that synthesizes 100,000 variant models from a corpus of real-world models and runs them on five enterprise accelerators, analyzing discrepancies via a data-flow tracing approach. Key findings show that newer platforms like Mac and Huawei support substantially fewer operators and exhibit higher output discrepancies, while compilation-based acceleration (e.g., TorchDynamo) is less reliable on these platforms, revealing 7 PyTorch bugs and 40 platform-specific issues. The work highlights reproducibility challenges in a diverse hardware landscape and provides actionable insights for reducing cross-platform divergence, including improving operator coverage and compiler robustness.

Abstract

While NVIDIA remains the dominant provider of AI accelerators within cloud data center, emerging vendors such as AMD, Intel, Mac, and Huawei offer cost-effective alternatives with claims of compatibility and performance. This paper presents the first empirical study investigating divergence in machine learning model across heterogeneous AI accelerators. Utilizing an automated pipeline, we synthesize over 100,000 variant models derived from 4,000 real-world models and execute them across five different enterprise-grade accelerators. Our findings suggest that newer AI platforms from Mac and Huawei support at least 17\% fewer operators than NVIDIA. These platforms also exhibit a higher rate of output discrepancies (exceeding 5\%), which stem from differences in operator implementations, handling of exceptional numerical values, and instruction scheduling. They are also more susceptible to failures during model compilation-based acceleration, and in some cases, the compiled models produce outputs that differ noticeably from those generated using the standard execution mode. In addition, we identify 7 implementation flaws in PyTorch and 40 platform-specific issues across vendors. These results underscore the challenges of achieving consistent machine learning behavior in an increasingly diverse hardware ecosystem.

Paper Structure

This paper contains 14 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Our pipeline to uncover the behavior inconsistency across different hardware platforms
  • Figure 2: PyTorch Model Representation: From High-Level Python Code to Computation Graph
  • Figure 3: Model Execution Failures in Default Execution Mode
  • Figure 4: Model Output Consistency Across Platforms
  • Figure 5: Result Consistency Between Compilation Mode and Eager Mode