Table of Contents
Fetching ...

GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision Tasks on FPGA

Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna

TL;DR

The paper tackles the challenge of end-to-end latency in GNN-based computer vision tasks that combine CNN and GNN computations. It introduces GCV-Turbo, a domain-specific FPGA accelerator with a unified data path and a PyTorch-compatible compiler that maps full GNN-CV models to hardware, enabling end-to-end optimization without reconfiguring the FPGA. The approach achieves significant latency reductions compared with CPU and GPU baselines across six tasks, while maintaining competitive performance on CNN-only and GNN-only models, and demonstrates competitivestanding against state-of-the-art CNN and GNN accelerators. The work establishes a hardware-compiler co-design paradigm for mixed CNN/GNN workloads and paves the way for broader adoption in latency-sensitive CV applications, with future plans to extend to Vision Transformers.

Abstract

Graph neural networks (GNNs) have recently empowered various novel computer vision (CV) tasks. In GNN-based CV tasks, a combination of CNN layers and GNN layers or only GNN layers are employed. This paper introduces GCV-Turbo, a domain-specific accelerator on FPGA for end-to-end acceleration of GNN-based CV tasks. GCV-Turbo consists of two key components: (1) a \emph{novel} hardware architecture optimized for the computation kernels in both CNNs and GNNs using the same set of computation resources. (2) a PyTorch-compatible compiler that takes a user-defined model as input, performs end-to-end optimization for the computation graph of a given GNN-based CV task, and produces optimized code for hardware execution. The hardware architecture and the compiler work synergistically to support a variety of GNN-based CV tasks. We implement GCV-Turbo on a state-of-the-art FPGA and evaluate its performance across six representative GNN-based CV tasks with diverse input data modalities (e.g., image, human skeleton, point cloud). Compared with state-of-the-art CPU (GPU) implementations, GCV-Turbo achieves an average latency reduction of $68.4\times$ ($4.1\times$) on these six GNN-based CV tasks. Moreover, GCV-Turbo supports the execution of the standalone CNNs or GNNs, achieving performance comparable to that of state-of-the-art CNN (GNN) accelerators for widely used CNN-only (GNN-only) models.

GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision Tasks on FPGA

TL;DR

The paper tackles the challenge of end-to-end latency in GNN-based computer vision tasks that combine CNN and GNN computations. It introduces GCV-Turbo, a domain-specific FPGA accelerator with a unified data path and a PyTorch-compatible compiler that maps full GNN-CV models to hardware, enabling end-to-end optimization without reconfiguring the FPGA. The approach achieves significant latency reductions compared with CPU and GPU baselines across six tasks, while maintaining competitive performance on CNN-only and GNN-only models, and demonstrates competitivestanding against state-of-the-art CNN and GNN accelerators. The work establishes a hardware-compiler co-design paradigm for mixed CNN/GNN workloads and paves the way for broader adoption in latency-sensitive CV applications, with future plans to extend to Vision Transformers.

Abstract

Graph neural networks (GNNs) have recently empowered various novel computer vision (CV) tasks. In GNN-based CV tasks, a combination of CNN layers and GNN layers or only GNN layers are employed. This paper introduces GCV-Turbo, a domain-specific accelerator on FPGA for end-to-end acceleration of GNN-based CV tasks. GCV-Turbo consists of two key components: (1) a \emph{novel} hardware architecture optimized for the computation kernels in both CNNs and GNNs using the same set of computation resources. (2) a PyTorch-compatible compiler that takes a user-defined model as input, performs end-to-end optimization for the computation graph of a given GNN-based CV task, and produces optimized code for hardware execution. The hardware architecture and the compiler work synergistically to support a variety of GNN-based CV tasks. We implement GCV-Turbo on a state-of-the-art FPGA and evaluate its performance across six representative GNN-based CV tasks with diverse input data modalities (e.g., image, human skeleton, point cloud). Compared with state-of-the-art CPU (GPU) implementations, GCV-Turbo achieves an average latency reduction of () on these six GNN-based CV tasks. Moreover, GCV-Turbo supports the execution of the standalone CNNs or GNNs, achieving performance comparable to that of state-of-the-art CNN (GNN) accelerators for widely used CNN-only (GNN-only) models.
Paper Structure (31 sections, 10 figures, 12 tables, 2 algorithms)

This paper contains 31 sections, 10 figures, 12 tables, 2 algorithms.

Figures (10)

  • Figure 1: Examples of GNN-based CV tasks garcia2018fewchen2019multizhang2019dualyan2018spatial
  • Figure 2: Breakdown analysis of GNN-based CV tasks (-) on state-of-the-art GPU (RTX A5000). The details of the models and datasets are elaborated in Section \ref{['sec:implementation']}.
  • Figure 3: Overview of GCV-Turbo
  • Figure 4: Workflow of GCV-Turbo using the skeleton-based human action recognition yan2018spatial as an example.
  • Figure 5: Architecture of hardware accelerator, and the basic computation primitives supported by a PE.
  • ...and 5 more figures