Table of Contents
Fetching ...

Towards Conceptual Compression

Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, Daan Wierstra

TL;DR

This work introduces convolutional DRAW, a recurrent variational auto-encoder that produces progressively abstract visual representations, separating global concepts from fine details. By stacking latent variables and employing iterative refinement, the model achieves state-of-the-art likelihoods on Omniglot, CIFAR-10, and ImageNet, and demonstrates Conceptual Compression by storing only high-level information. The authors also present compression-oriented techniques (arithmetic coding and bits-back coding) and analyze information distribution across layers and time steps, showing early high-level information followed by detail refinement. Overall, the method advances unsupervised, latent-variable image modeling and highlights practical routes to high-quality lossy compression that aligns with human perceptual judgments.

Abstract

We introduce a simple recurrent variational auto-encoder architecture that significantly improves image modeling. The system represents the state-of-the-art in latent variable models for both the ImageNet and Omniglot datasets. We show that it naturally separates global conceptual information from lower level details, thus addressing one of the fundamentally desired properties of unsupervised learning. Furthermore, the possibility of restricting ourselves to storing only global information about an image allows us to achieve high quality 'conceptual compression'.

Towards Conceptual Compression

TL;DR

This work introduces convolutional DRAW, a recurrent variational auto-encoder that produces progressively abstract visual representations, separating global concepts from fine details. By stacking latent variables and employing iterative refinement, the model achieves state-of-the-art likelihoods on Omniglot, CIFAR-10, and ImageNet, and demonstrates Conceptual Compression by storing only high-level information. The authors also present compression-oriented techniques (arithmetic coding and bits-back coding) and analyze information distribution across layers and time steps, showing early high-level information followed by detail refinement. Overall, the method advances unsupervised, latent-variable image modeling and highlights practical routes to high-quality lossy compression that aligns with human perceptual judgments.

Abstract

We introduce a simple recurrent variational auto-encoder architecture that significantly improves image modeling. The system represents the state-of-the-art in latent variable models for both the ImageNet and Omniglot datasets. We show that it naturally separates global conceptual information from lower level details, thus addressing one of the fundamentally desired properties of unsupervised learning. Furthermore, the possibility of restricting ourselves to storing only global information about an image allows us to achieve high quality 'conceptual compression'.

Paper Structure

This paper contains 18 sections, 3 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Conceptual Compression [Omniglot]. The top row shows full reconstructions from the model. The subsequent rows were obtained by storing the first $t$ groups of latent variables and generating the remaining ones from the model ($t=1, 4, 7, 10, 13, 16, 19, 22, 25, 28$ are shown, out of a total of $30$ steps, from top to bottom). Each group of four columns shows different samples at a given compression level. We see that variations in later samples lie in small details, such as the precise placement of strokes. Reducing the number of stored bits tends to preserve the overall shape, but increases the symbol variation. Eventually a varied set of symbols are generated. Nevertheless even in the first row there is a clear difference between variations produced from a given symbol and those between different symbols.
  • Figure 2: Conceptual Compression [ImageNet] Analogous to Figure \ref{['fig:informationContentOmniglot']} but applied to natural images. Originals are placed on the bottom to compare more easily to the final reconstructions, which are nearly perfect. Here the latent variables were generated with zero variance. Iterations $t = 2, 4, 6, 8, 10, 14, 18, 25, 32$ of the model with $32$ steps are shown.
  • Figure 3: Schematic depiction of one time slice in convolutional DRAW. $X$ and $R$ denote input and reconstruction respectively.
  • Figure 4: Lossy Compression. Example images for various methods and levels of compression. Top block: original images. Each subsequent block has four rows corresponding to four methods of compression: JPEG, JPEG2000, convolutional DRAW with full prior variance for generation and convolutional DRAW with zero prior variance. Each block corresponds to a different compression level; from top to bottom, average number of bits per input dimension are: 0.05, 0.1, 0.15, 0.2, 0.4, 0.8 (bits per image: 153, 307, 460, 614, 1228, 2457). In the first block, JPEG was left gray because it does not compress to this level. Images are of size $32\times32$. See appendix for $64\times64$ images.
  • Figure 5: Generated samples for Omniglot.
  • ...and 8 more figures