Table of Contents
Fetching ...

GCN-ABFT: Low-Cost Online Error Checking for Graph Convolutional Networks

Christodoulos Peltekis, Giorgos Dimitrakopoulos

TL;DR

GCNs deployed in hardware face reliability challenges from random faults. The authors introduce GCN-ABFT, a fused ABFT approach that computes a single checksum for the complete three-matrix product $H_{out} = S\,H\,W$ per layer, instead of performing checks after each matrix multiplication. Across four node-classification benchmarks, GCN-ABFT achieves approximately 21% average savings in checksum-related operations while maintaining fault-detection accuracy, with fault-injection experiments showing high detection rates (above 93%) and reduced false positives. The method trades a small per-layer delay and a potential miss in pathological zero-columns of $S$ for notable energy savings and broad applicability to GNNs and related three-matrix operations.

Abstract

Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key architectural challenge for GCN accelerators: how to detect errors in GCN computations arising from random hardware faults with the least computation cost. Each GCN layer performs a graph convolution, mathematically equivalent to multiplying three matrices, computed through two separate matrix multiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can check the results of individual matrix multiplications. However, for a GCN layer, this check should be performed twice. To avoid this overhead, this work introduces GCN-ABFT that directly calculates a checksum for the entire three-matrix product within a single GCN layer, providing a cost-effective approach for error detection in GCN accelerators. Experimental results demonstrate that GCN-ABFT reduces the number of operations needed for checksum computation by over 21% on average for representative GCN applications. These savings are achieved without sacrificing fault-detection accuracy, as evidenced by the presented fault-injection analysis.

GCN-ABFT: Low-Cost Online Error Checking for Graph Convolutional Networks

TL;DR

GCNs deployed in hardware face reliability challenges from random faults. The authors introduce GCN-ABFT, a fused ABFT approach that computes a single checksum for the complete three-matrix product per layer, instead of performing checks after each matrix multiplication. Across four node-classification benchmarks, GCN-ABFT achieves approximately 21% average savings in checksum-related operations while maintaining fault-detection accuracy, with fault-injection experiments showing high detection rates (above 93%) and reduced false positives. The method trades a small per-layer delay and a potential miss in pathological zero-columns of for notable energy savings and broad applicability to GNNs and related three-matrix operations.

Abstract

Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key architectural challenge for GCN accelerators: how to detect errors in GCN computations arising from random hardware faults with the least computation cost. Each GCN layer performs a graph convolution, mathematically equivalent to multiplying three matrices, computed through two separate matrix multiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can check the results of individual matrix multiplications. However, for a GCN layer, this check should be performed twice. To avoid this overhead, this work introduces GCN-ABFT that directly calculates a checksum for the entire three-matrix product within a single GCN layer, providing a cost-effective approach for error detection in GCN accelerators. Experimental results demonstrate that GCN-ABFT reduces the number of operations needed for checksum computation by over 21% on average for representative GCN applications. These savings are achieved without sacrificing fault-detection accuracy, as evidenced by the presented fault-injection analysis.

Paper Structure

This paper contains 11 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: ABFT applied separately on the two phases of graph convolution operation.
  • Figure 2: GCN-ABFT applied in the two phases of graph convolution. Fusing checksum computation removes the need to enhance matrix $H$ with additional check state.
  • Figure 3: How the execution time is split across the first and the second matrix multiplication step of each GCN layer for both layers of the examined GCN applications.