Vectorised Hashing Based on Bernstein-Rabin-Winograd Polynomials over Prime Order Fields
Kaushik Nath, Palash Sarkar
TL;DR
The paper presents a BRW-based AXU hash, ${\rm c}$-decBRWHash, and demonstrates highly optimised AVX2 assembly implementations enabling 4-way parallel evaluation over primes $p=2^{130}-5$ and $p=2^{127}-1$. It formalises padding, BRW/PolyHash functions, and AXU bounds, and provides a detailed vectorised algorithmic framework showing that decimated BRW hashing can outperform Poly1305/PolyHash on typical file sizes by roughly 16% to 23% on $p=2^{130}-5$. The main contributions include the decimated BRW construction, rigorous AXU analysis, and the practical, hand-optimised AVX2 implementations with extensive timing results. The work demonstrates that BRW-based AXU hashing can offer meaningful performance advantages for authentication and authenticated encryption in modern software environments, especially for kilobyte-to-megabyte data scales.
Abstract
We introduce the new AXU hash function decBRWHash, which is parameterised by the positive integer $c$ and is based on Bernstein-Rabin-Winograd (BRW) polynomials. Choosing $c>1$ gives a hash function which can be implemented using $c$-way single instruction multiple data (SIMD) instructions. We report a set of very comprehensive hand optimised assembly implementations of 4-decBRWHash using avx2 SIMD instructions available on modern Intel processors. For comparison, we also report similar carefully optimised avx2 assembly implementations of polyHash, an AXU hash function based on usual polynomials. Our implementations are over prime order fields, specifically the primes $2^{127}-1$ and $2^{130}-5$. For the prime $2^{130}-5$, for avx2 implementations, compared to the famous Poly1305 hash function, 4-decBRWHash is faster for messages which are a few hundred bytes long and achieves a speed-up of about 16% for message lengths in a few kilobytes range and improves to a speed-up of about 23% for message lengths in a few megabytes range.
