Memory-efficient Sketch Acceleration for Handling Large Network Flows on FPGAs
Zhaoyang Han, Yicheng Qian, Michael Zink, Miriam Leeser
TL;DR
This work tackles memory bottlenecks in Count-Min Sketch implementations on FPGA-based NICs for large-volume network traffic. It introduces HBRICK, a hardware-friendly, variable-width counter design that extends the CM sketch by supporting larger hash tables with reduced overestimation, implemented via a P4 front end and HLS compute kernels and validated on an AMD Alveo U280 at line-rate on a 100 Gbps link. Key contributions include the hardware-friendly multi-level counter design, parallel indexing with a fixed-latency update path, data packing for optional levels, and an overflow store with an associative memory, all integrated into an end-to-end P4+HLS workflow on the Open Cloud Testbed. Experimental results show improved BRAM efficiency, competitive throughput (real-time ~92 Gbps; theoretical ~195 Gbps for 64-byte packets), and favorable accuracy on skewed traffic, establishing a practical path for scalable in-network analytics on FPGA NICs.
Abstract
Sketch-based algorithms for network traffic monitoring have drawn increasing interest in recent years due to their sub-linear memory efficiency and high accuracy. As the volume of network traffic grows, software-based sketch implementations cannot match the throughput of the incoming network flows. FPGA-based hardware sketch has shown better performance compared to software running on a CPU when handling these packets. Among the various sketch algorithms, Count-min sketch is one of the most popular and efficient. However, due to the limited amount of on-chip memory, the FPGA-based count-Min sketch accelerator suffers from performance drops as network traffic grows. In this work, we propose a hardware-friendly architecture with a variable width memory counter for count-min sketch. Our architecture provides a more compact design to store the sketch data structure effectively, allowing us to support larger hash tables and reduce overestimation errors. The design makes use of a P4-based programmable data plane and the AMD OpenNIC shell. The design is implemented and verified on the Open Cloud Testbed running on AMD Alveo U280s and can keep up with the 100 Gbit link speed.
