Towards High-Performance Network Coding: FPGA Acceleration With Bounded-value Generators

Jiaxin Qing; Philip H. W. Leong; Kin Hong Lee; Raymond W. Yeung

Towards High-Performance Network Coding: FPGA Acceleration With Bounded-value Generators

Jiaxin Qing, Philip H. W. Leong, Kin Hong Lee, Raymond W. Yeung

TL;DR

This paper addresses the practicality of implementing high-throughput network coding with Batched Sparse (BATS) codes on hardware. It introduces CS-BATS, a structured variant that enables efficient hardware mapping, and BV generators that dramatically shrink finite-field multiplier complexity while preserving coding performance. The authors design a scalable FPGA accelerator using BATS Compute Units, matrix tiling, multi-level parallelism, and HBM, achieving up to 27 Gbps throughput and over 300× software speedup, with BV generators reducing multiplier area by up to 70% and providing substantial resource savings. Theoretical and empirical analyses show BV generators incur negligible impact on coding performance when sized appropriately (e.g., $L(2^2)$), and extensive implementation results demonstrate scalable throughput with respect to the number of CUs, port configurations, and HBM settings. Overall, the work demonstrates a viable hardware-software co-design path for practical, high-rate network coding, with meaningful implications for wireless and distributed storage systems.

Abstract

Network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched Sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field operations. Most literature focuses on algorithmic-level techniques to improve coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present CS-BATS, a hardware-friendly variant. Next we propose a simple but effective bounded-value generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient FPGA-based network coding accelerator that achieves a throughput of 27 Gbps, a speedup of more than 300 over software.

Towards High-Performance Network Coding: FPGA Acceleration With Bounded-value Generators

TL;DR

), and extensive implementation results demonstrate scalable throughput with respect to the number of CUs, port configurations, and HBM settings. Overall, the work demonstrates a viable hardware-software co-design path for practical, high-rate network coding, with meaningful implications for wireless and distributed storage systems.

Abstract

Paper Structure (38 sections, 16 equations, 9 figures, 3 tables, 6 algorithms)

This paper contains 38 sections, 16 equations, 9 figures, 3 tables, 6 algorithms.

Introduction
Background
BATS Code
Random BATS
Cyclic-Shift BATS (CS-BATS)
Construction
Complexity
BATS Compute Unit
Decouple Loading and Computing
Matrix tiling
Maximize memory bandwidth utilization
Multi-level Parallelism and HBM
Input Port Contention and Delay
Output Port Contention
Load Balance Scheduler
...and 23 more sections

Figures (9)

Figure 1: Graphical description of the BATS code. (a) Tanner graph representation. Circles are variable nodes, and squares are check nodes; (b) Adjacency matrix representation. Each darkened cell represents a connection; (c) Encoding process. Linear combinations mix information from input packets.
Figure 2: CS-BATS construction from a base graph. The base graph is a 4 by 8 adjacency matrix. Cyclic shifting is applied to each row of the base graph to construct new rows.
Figure 3: Decouple loading and computing. When loading a tile of $t_m\times t_k$ from matrix B, for each data transaction, we load $t_m+\alpha$ elements from its column. The $\alpha$ is chosen to fully utilize the memory bandwidth.
Figure 4: (a) Block diagram of the BATS accelerator system. FF PE: Finite field processing elements; BATS CU: BATS compute unit; The accelerator can have multiple BATS CUs. Each BATS CU accesses the HBM through an individual HBM pseudo channel (PC). With the Xilinx HLS Vitis flow, the maximum port width between the PL and the HBM is 512 bits. (b) Matrix tiling along the row, which reuses the tiles of matrix B. (c) The internal routing of the crossbar switch.
Figure 5: Scheduling comparison. Runtime comparison of 4 CUs is shown. The clock cycles shown are for discussion purposes. (a) Sequential scheduling. (b) Load balance scheduling.
...and 4 more figures

Theorems & Definitions (1)

Definition 1

Towards High-Performance Network Coding: FPGA Acceleration With Bounded-value Generators

TL;DR

Abstract

Towards High-Performance Network Coding: FPGA Acceleration With Bounded-value Generators

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)