Table of Contents
Fetching ...

Chorba: A novel CRC32 implementation

Sam Russell

TL;DR

This work targets efficient CRC32 computation without lookup tables or dedicated hardware by exploiting zero polynomials $Z(x)$ with $Z(x) mod G(x) = 0$ to reduce the data polynomial $M(x)$ during reduction by the generator $G(x)$. It introduces and analyzes sparse and dense zero polynomials, including the scaled generator polynomial and several 3–5 term zeros, to achieve high-throughput software implementations via braiding-like techniques and selective folding, with careful considerations for destructive versus non-destructive processing. Empirical results on diverse CPUs (Ryzen, Graviton2, Raspberry Pi 4) show substantial throughput improvements, often matching or surpassing hardware-accelerated solutions for larger messages, though small-message performance favors simpler polynomials due to initialization costs. The study also extends to AVX-enabled architectures, where memory bandwidth becomes the limiting factor, and identifies a robust all-around polynomial, $x^{14870} + x^{22} + x^{11} + x^7 + 1$ scaled by 8, as a practical drop-in replacement, while noting variability across hardware and workloads. $G(x)$ denotes the CRC32 generator polynomial, and $M(x)$ the message polynomial; the core idea is to use zero polynomials to transform long reductions into a sequence of cheaper ops in GF($2$).

Abstract

This paper describes a novel method for efficiently calculating CRC checksums without lookup tables or hardware support for polynomial multiplication. Throughput of CRC32 is increased by 100% across different platforms compared with the current state of the art. Performance is on par with or exceeds hardware-accelerated solutions on x86_64 and ARMv8 processors, and these hardware-accelerated solutions see a performance increase of 5-20% depending on message length. The small number of operations required with this approach could simplify hardware CRC32 implementations.

Chorba: A novel CRC32 implementation

TL;DR

This work targets efficient CRC32 computation without lookup tables or dedicated hardware by exploiting zero polynomials with to reduce the data polynomial during reduction by the generator . It introduces and analyzes sparse and dense zero polynomials, including the scaled generator polynomial and several 3–5 term zeros, to achieve high-throughput software implementations via braiding-like techniques and selective folding, with careful considerations for destructive versus non-destructive processing. Empirical results on diverse CPUs (Ryzen, Graviton2, Raspberry Pi 4) show substantial throughput improvements, often matching or surpassing hardware-accelerated solutions for larger messages, though small-message performance favors simpler polynomials due to initialization costs. The study also extends to AVX-enabled architectures, where memory bandwidth becomes the limiting factor, and identifies a robust all-around polynomial, scaled by 8, as a practical drop-in replacement, while noting variability across hardware and workloads. denotes the CRC32 generator polynomial, and the message polynomial; the core idea is to use zero polynomials to transform long reductions into a sequence of cheaper ops in GF().

Abstract

This paper describes a novel method for efficiently calculating CRC checksums without lookup tables or hardware support for polynomial multiplication. Throughput of CRC32 is increased by 100% across different platforms compared with the current state of the art. Performance is on par with or exceeds hardware-accelerated solutions on x86_64 and ARMv8 processors, and these hardware-accelerated solutions see a performance increase of 5-20% depending on message length. The small number of operations required with this approach could simplify hardware CRC32 implementations.

Paper Structure

This paper contains 14 sections, 7 figures.

Figures (7)

  • Figure 1: Expanding the generator polynomial with the scaling identity
  • Figure 2: Comparing manually unrolled bitwise loops against braiding
  • Figure 3: AMD Ryzen 5 5600
  • Figure 4: AWS Graviton2
  • Figure 5: Raspberry Pi 4
  • ...and 2 more figures