Banked Memories for Soft SIMT Processors
Martin Langhammer, George A. Constantinides
TL;DR
The paper addresses memory bottlenecks in soft SIMT FPGA processors by contrasting nine memory architectures (multi-port and banked) across $51$ configurations using matrix transpose and FFT workloads. It proposes a scalable banked memory design with read/write issue controllers, shared-memory arbitration, and carry-based arbitration, and evaluates FPGA-fitting and area costs. Key findings show multi-port memories often outperform banked designs for small memories due to simplicity, while banked memories maintain a stable, high-bandwidth footprint as data size grows; the best choice thus depends on memory size and access patterns. The work provides practical guidance for designing FPGA-based soft GPGPU-like accelerators and is relevant to high-level synthesis (HLS) and other FPGA-driven accelerators seeking explicit memory-architecture tradeoffs.
Abstract
Recent advances in soft GPGPU architectures have shown that a small (<10K LUT), high performance (770 MHz) processor is possible in modern FPGAs. In this paper we architect and evaluate soft SIMT processor banked memories, which can support high bandwidth (up to 16 ports) while maintaining high speed (over 770 MHz). We compare 9 different memory architectures, including simpler multi-port memories, and run a total of 51 benchmarks (different combinations of algorithms, data sizes and processor memories) to develop a comprehensive set of data which will guide the reader in making an informed memory architecture decision for their application. Our benchmarks are comprised of matrix transpositions (memory intensive) and FFTs (split between memory accesses, floating point, and integer computations) to provide a balanced evaluation. We show that the simpler (but more memory block intensive) multi-port memories offer higher performance than the more architecturally complex banked memories for many applications, especially for smaller memories, but the effective footprint cost of the multi-port memories quickly becomes prohibitive as dataset sizes increase. Our banked memory implementation results - high bandwidth, high Fmax, and high density - can be used for other FPGA applications as well, such as HLS (High Level Synthesis).
