Table of Contents
Fetching ...

Variational image compression with a scale hyperprior

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, Nick Johnston

TL;DR

Ballé et al. present a variational image compression framework augmented with a scale hyperprior to model spatial dependencies in the latent representation. The hyperprior acts as learned side information, improving entropy coding by conditioning the latent distribution on an auxiliary z. Empirical results on Kodak and Tecnick show state-of-the-art MS-SSIM performance among neural methods and competitive PSNR, highlighting the importance of flexible priors in learned compression. The work demonstrates the impact of distortion metrics on perceptual quality and establishes a principled approach to incorporate side information into end-to-end neural codecs.

Abstract

We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.

Variational image compression with a scale hyperprior

TL;DR

Ballé et al. present a variational image compression framework augmented with a scale hyperprior to model spatial dependencies in the latent representation. The hyperprior acts as learned side information, improving entropy coding by conditioning the latent distribution on an auxiliary z. Empirical results on Kodak and Tecnick show state-of-the-art MS-SSIM performance among neural methods and competitive PSNR, highlighting the importance of flexible priors in learned compression. The work demonstrates the impact of distortion metrics on perceptual quality and establishes a principled approach to incorporate side information into end-to-end neural codecs.

Abstract

We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.

Paper Structure

This paper contains 15 sections, 23 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Left: representation of a transform coding model as a generative Bayesian model, and a corresponding variational inference model. Nodes represent random variables or parameters, and arrows indicate conditional dependence between them. Right: diagram showing the operational structure of the compression model. Arrows indicate the flow of data, and boxes represent transformations of the data. Boxes labeled $\mathcal{U}\mid Q$ represent either addition of uniform noise applied during training (producing vectors labeled with a tilde), or quantization and arithmetic coding/decoding during testing (producing vectors labeled with a hat).
  • Figure 2: Left: an image from the Kodak dataset. Middle left: visualization of a subset of the latent representation $\bm y$ of that image, learned by our factorized-prior model. Note that there is clearly visible structure around edges and textured regions, indicating that a dependency structure exists in the marginal which is not represented in the factorized prior. Middle right: standard deviations $\bm{\hat{\sigma}}$ of the latents as predicted by the model augmented with a hyperprior. Right: latents $\bm y$ divided elementwise by their standard deviation. Note how this reduces the apparent structure, indicating that the structure is captured by the new prior.
  • Figure 3: As in figure \ref{['fig:diagrams']}, but extended with a hyperprior.
  • Figure 4: Network architecture of the hyperprior model. The left side shows an image autoencoder architecture, the right side corresponds to the autoencoder implementing the hyperprior. The factorized-prior model uses the identical architecture for the analysis and synthesis transforms $g_a$ and $g_s$. Q represents quantization, and AE, AD represent arithmetic encoder and arithmetic decoder, respectively. Convolution parameters are denoted as: number of filters $\times$ kernel support height $\times$ kernel support width $/$ down- or upsampling stride, where $\uparrow$ indicates upsampling and $\downarrow$ downsampling. $N$ and $M$ were chosen dependent on $\lambda$, with $N=128$ and $M=192$ for the 5 lower values, and $N=192$ and $M=320$ for the 3 higher values.
  • Figure 5: Rate--distortion curves aggregated over the Kodak dataset. The top plot shows peak signal-to-noise ratios as a function of bit rate ($10 \log_{10} \frac{255^2}{d}$, with $d$ representing mean squared error), the bottom plot shows MS-SSIM values converted to decibels ($-10 \log_{10}(1-d)$, where $d$ is the MS-SSIM value in the range between zero and one). We observe that matching the training loss to the metric used for evaluation is crucial to optimize performance. Our hyperprior model trained on squared error outperforms all other ANN-based methods in terms of PSNR, and approximates HEVC performance. In terms of MS-SSIM, the hyperprior model consistently outperforms conventional codecs as well as RiBo17, the current state-of-the-art model for that metric. Note that the PSNR plot aggregates curves over equal values of $\lambda$, and the MS-SSIM plot aggregates over equal rates (with interpolation), in order to provide a fair comparison to both state-of-the-art methods. Refer to figures \ref{['fig:kodak-psnr-full']} and \ref{['fig:kodak-msssim-full']} in the appendix for full-page RD curves that include a wider range of compression methods.
  • ...and 9 more figures