Learned Compression of Encoding Distributions
Mateen Ulhaq, Ivan V. Bajić
TL;DR
This work tackles the amortization gap in learned image compression by introducing input-specific encoding distributions that are compressed and transmitted as side-information. The method uses kernel-density-estimation-based histogram estimation to derive target per-channel distributions and lightweight neural modules to reconstruct adaptive encoding distributions at the decoder, reducing the rate overhead while improving rate-distortion performance. Results on Kodak demonstrate a BD-rate reduction of $-7.10\%$ for the standard fully-factorized model, with substantial reductions in both model size and computation compared to scale hyperprior approaches. The approach provides a practical pathway to enhance entropy models with low overhead, enabling more efficient learned compression without extensive architectural changes.
Abstract
The entropy bottleneck introduced by Ballé et al. is a common component used in many learned compression models. It encodes a transformed latent representation using a static distribution whose parameters are learned during training. However, the actual distribution of the latent data may vary wildly across different inputs. The static distribution attempts to encompass all possible input distributions, thus fitting none of them particularly well. This unfortunate phenomenon, sometimes known as the amortization gap, results in suboptimal compression. To address this issue, we propose a method that dynamically adapts the encoding distribution to match the latent data distribution for a specific input. First, our model estimates a better encoding distribution for a given input. This distribution is then compressed and transmitted as an additional side-information bitstream. Finally, the decoder reconstructs the encoding distribution and uses it to decompress the corresponding latent data. Our method achieves a Bjøntegaard-Delta (BD)-rate gain of -7.10% on the Kodak test dataset when applied to the standard fully-factorized architecture. Furthermore, considering computational complexity, the transform used by our method is an order of magnitude cheaper in terms of Multiply-Accumulate (MAC) operations compared to related side-information methods such as the scale hyperprior.
