Chorba: A novel CRC32 implementation
Sam Russell
TL;DR
This work targets efficient CRC32 computation without lookup tables or dedicated hardware by exploiting zero polynomials $Z(x)$ with $Z(x) mod G(x) = 0$ to reduce the data polynomial $M(x)$ during reduction by the generator $G(x)$. It introduces and analyzes sparse and dense zero polynomials, including the scaled generator polynomial and several 3–5 term zeros, to achieve high-throughput software implementations via braiding-like techniques and selective folding, with careful considerations for destructive versus non-destructive processing. Empirical results on diverse CPUs (Ryzen, Graviton2, Raspberry Pi 4) show substantial throughput improvements, often matching or surpassing hardware-accelerated solutions for larger messages, though small-message performance favors simpler polynomials due to initialization costs. The study also extends to AVX-enabled architectures, where memory bandwidth becomes the limiting factor, and identifies a robust all-around polynomial, $x^{14870} + x^{22} + x^{11} + x^7 + 1$ scaled by 8, as a practical drop-in replacement, while noting variability across hardware and workloads. $G(x)$ denotes the CRC32 generator polynomial, and $M(x)$ the message polynomial; the core idea is to use zero polynomials to transform long reductions into a sequence of cheaper ops in GF($2$).
Abstract
This paper describes a novel method for efficiently calculating CRC checksums without lookup tables or hardware support for polynomial multiplication. Throughput of CRC32 is increased by 100% across different platforms compared with the current state of the art. Performance is on par with or exceeds hardware-accelerated solutions on x86_64 and ARMv8 processors, and these hardware-accelerated solutions see a performance increase of 5-20% depending on message length. The small number of operations required with this approach could simplify hardware CRC32 implementations.
