Lossy Image Compression with Compressive Autoencoders
Lucas Theis, Wenzhe Shi, Andrew Cunningham, Ferenc Huszár
TL;DR
This paper addresses the need for flexible, high-performance lossy image compression by introducing compressive autoencoders (CAEs) trained end-to-end. It tackles the non-differentiability of quantization with a differentiable gradient surrogate and a differentiable upper bound on the entropy term, enabling direct optimization of the rate-distortion objective $- ext{log}_2 Q([f(\mathbf{x})]) + \beta\, d(\mathbf{x}, g([f(\mathbf{x})]))$. The authors demonstrate competitive performance to JPEG 2000 on natural images, with superior perceptual metrics such as SSIM and MOS, and achieve efficient large-scale decoding through a sub-pixel architecture and GSM-based entropy modeling. A key contribution is an incremental training procedure and scalable rate control via learnable scale parameters, allowing adaptation to multiple bitrates without retraining from scratch. The work lays a foundation for end-to-end, content-specific compression and suggests avenues for incorporating perceptual metrics or GAN-based enhancements for further improvements.
Abstract
We propose a new approach to the problem of optimizing autoencoders for lossy image compression. New media formats, changing hardware technology, as well as diverse requirements and content types create a need for compression algorithms which are more flexible than existing codecs. Autoencoders have the potential to address this need, but are difficult to optimize directly due to the inherent non-differentiabilty of the compression loss. We here show that minimal changes to the loss are sufficient to train deep autoencoders competitive with JPEG 2000 and outperforming recently proposed approaches based on RNNs. Our network is furthermore computationally efficient thanks to a sub-pixel architecture, which makes it suitable for high-resolution images. This is in contrast to previous work on autoencoders for compression using coarser approximations, shallower architectures, computationally expensive methods, or focusing on small images.
