Table of Contents
Fetching ...

RAS: A Bit-Exact rANS Accelerator For High-Performance Neural Lossless Compression

Yuchao Qin, Anjunyi Fan, Bonan Yan

TL;DR

Lossless data-center compression demands high throughput with bit-exact recovery. RAS is a hardware–software codesign that fuses an rANS core with a neural probability generator, employing a mixed-precision probability path, a two-stage update, and a prediction-guided decoder to prune CDF searches while preserving exactness. Key contributions include a BF16-to-fixed-point probability interface with mass correction, a two-stage update enabling sustained throughput, and a decoder that uses prediction to reduce average search depth, all scalable via a multi-lane architecture. The approach demonstrates substantial RTL speedups (encode ≈121.2×, decode ≈70.9×) while maintaining competitive compression when paired with learned priors, offering a practical path to fast neural lossless compression in data-center workloads. The work generalizes to other ANS variants and highlights a viable route to integrate on-chip probability generation for energy- and latency-aware lossless coding.

Abstract

Data centers handle vast volumes of data that require efficient lossless compression, yet emerging probabilistic models based methods are often computationally slow. To address this, we introduce RAS, the Range Asymmetric Numeral System Acceleration System, a hardware architecture that integrates the rANS algorithm into a lossless compression pipeline and eliminates key bottlenecks. RAS couples an rANS core with a probabilistic generator, storing distributions in BF16 format and converting them once into a fixed-point domain shared by a unified division/modulo datapath. A two-stage rANS update with byte-level re-normalization reduces logic cost and memory traffic, while a prediction-guided decoding path speculatively narrows the cumulative distribution function (CDF) search window and safely falls back to maintain bit-exactness. A multi-lane organization scales throughput and enables fine-grained clock gating for efficient scheduling. On image workloads, our RTL-simulated prototype achieves 121.2x encode and 70.9x decode speedups over a Python rANS baseline, reducing average decoder binary-search steps from 7.00 to 3.15 (approximately 55% fewer). When paired with neural probability models, RAS sustains higher compression ratios than classical codecs and outperforms CPU/GPU rANS implementations, offering a practical approach to fast neural lossless compression.

RAS: A Bit-Exact rANS Accelerator For High-Performance Neural Lossless Compression

TL;DR

Lossless data-center compression demands high throughput with bit-exact recovery. RAS is a hardware–software codesign that fuses an rANS core with a neural probability generator, employing a mixed-precision probability path, a two-stage update, and a prediction-guided decoder to prune CDF searches while preserving exactness. Key contributions include a BF16-to-fixed-point probability interface with mass correction, a two-stage update enabling sustained throughput, and a decoder that uses prediction to reduce average search depth, all scalable via a multi-lane architecture. The approach demonstrates substantial RTL speedups (encode ≈121.2×, decode ≈70.9×) while maintaining competitive compression when paired with learned priors, offering a practical path to fast neural lossless compression in data-center workloads. The work generalizes to other ANS variants and highlights a viable route to integrate on-chip probability generation for energy- and latency-aware lossless coding.

Abstract

Data centers handle vast volumes of data that require efficient lossless compression, yet emerging probabilistic models based methods are often computationally slow. To address this, we introduce RAS, the Range Asymmetric Numeral System Acceleration System, a hardware architecture that integrates the rANS algorithm into a lossless compression pipeline and eliminates key bottlenecks. RAS couples an rANS core with a probabilistic generator, storing distributions in BF16 format and converting them once into a fixed-point domain shared by a unified division/modulo datapath. A two-stage rANS update with byte-level re-normalization reduces logic cost and memory traffic, while a prediction-guided decoding path speculatively narrows the cumulative distribution function (CDF) search window and safely falls back to maintain bit-exactness. A multi-lane organization scales throughput and enables fine-grained clock gating for efficient scheduling. On image workloads, our RTL-simulated prototype achieves 121.2x encode and 70.9x decode speedups over a Python rANS baseline, reducing average decoder binary-search steps from 7.00 to 3.15 (approximately 55% fewer). When paired with neural probability models, RAS sustains higher compression ratios than classical codecs and outperforms CPU/GPU rANS implementations, offering a practical approach to fast neural lossless compression.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Overview of the hardware-software codesign for learned lossless compression. Modeling: neural/PC models for images and LLMs for text provide calibrated probabilities. Algorithmic: progression from Huffman and software rANS to rANS on RAS (this work) increases throughput and compression ratio. System/Hardware: RAS integrates a control unit, networks, pipelined encode/decode, and shared global memory--replacing a low-efficiency CPU/GPU + probability generator while preserving bit-exactness.
  • Figure 2: Overall RAS architecture. Left: Learnable models (orange block) produce absolute distributions that are stored in Global Memory (blue block); the SPC performs a single BF16$\rightarrow$fixed-point conversion with mass correction and streams shared CDF/frequency tables. Middle: The rANS Encoder and Decoder share a mixed-precision div/mod datapath structure with a two-stage update (parallel quotient/remainder) and byte-level re-normalization; per-lane Middle-State and Low-bit State memories sustain throughput. Right: A prediction-guided decoding path proposes a trial symbol and verifies it against the CDF, falling back on mismatch—reducing average search while preserving bit-exactness. A simple multi-lane fabric with arbitration and clock gating scales throughput without changing the bitstream.
  • Figure 3: Prediction-guided rANS decoding: the neighborhood average (201) defines a window $[\mathrm{Avg}-8,\mathrm{Avg}+8]$; a dichotomous refinement ($\pm8 \rightarrow \pm4$) resolves the symbol, yielding the correct value 205.
  • Figure 4: Design exploration results (simulated): (a) Cycle-normalized compute cost (cycles/run; lower is better) for compression and decompression, comparing RAS with a Python rANS baseline; annotations show speed-ups of $121.2\times$ (encode) and $70.9\times$ (decode). (b) Decoder binary-search cost: average steps per symbol drop from $7.00$ to $3.15$ with prediction ($\approx55\%$ fewer). (c) Compression ratio on ImageNet32/64 and CIFAR-10 for classical codecs (solid) and neural rANS-based models (dashed); neural models with rANS algorithm (IDF, PiMC) achieve higher ratios.