Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

Sed Centeno; Christopher Sprague; Arnab A Purkayastha; Ray Simar; Neeraj Magotra

Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

Sed Centeno, Christopher Sprague, Arnab A Purkayastha, Ray Simar, Neeraj Magotra

TL;DR

The paper addresses the challenge of accelerating neural network training on hardware accelerators by enabling overlap between the forward and backward passes through speculative backpropagation. It implements a CPU-based OpenMP prototype for MNIST and demonstrates substantial speedups (up to $24\%$ overall and $35\%$ per-step) with accuracy within $3$–$4\%$ of the baseline at a threshold of $0.25$, while planning FPGA synthesis for hardware deployment. The methodology combines speculative gradient reuse with thread-level parallelism and a simple 4-layer network, providing a clear pathway toward FPGA-based acceleration. The work highlights the practicality of speculative training techniques for energy-efficient, low-latency AI workloads and motivates extensions to deeper models and RTL implementations for real hardware.

Abstract

Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes. Leveraging speculative weight updates when error gradients fall within a specific threshold reduces training time without substantially compromising accuracy. In this work, we implement speculative backpropagation on the MNIST dataset using OpenMP as the parallel programming platform. OpenMP's multi-threading capabilities enable simultaneous execution of forward and speculative backpropagation steps, significantly improving training speed. The application is planned for synthesis on a state-of-the-art FPGA to demonstrate its potential for hardware acceleration. Our CPU-based experimental results demonstrate that speculative backpropagation achieves a maximum speedup of 24% in execution time when using a threshold of 0.25, and accuracy remaining within 3-4% of the baseline across various epochs. Additionally, when comparing individual step execution time, speculative backpropagation yields a maximum speedup of 35% over the baseline, demonstrating the effectiveness of overlapping forward and backward passes.

Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

TL;DR

Abstract

Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)