Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
Onur G. Guleryuz, Philip A. Chou, Berivan Isik, Hugues Hoppe, Danhang Tang, Ruofei Du, Jonathan Taylor, Philip Davidson, Sean Fanello
TL;DR
The paper introduces a sandwich architecture that places a standard codec between neural pre- and post-processors, trained end-to-end via a differentiable codec proxy to optimize the rate-distortion objective $D(R)$. It demonstrates substantial RD gains across diverse image/video scenarios, including mismatched channels, high dynamic range, and non-RGB data, as well as with perceptual metrics such as LPIPS. Theoretical results bound the performance of the optimal sandwich and show how the inner codebook can be repurposed with a rate penalty $D(p||q)$, while practical proxies enable training with JPEG/HEVC/AV1-like codecs. Empirically, the approach yields 6–9 dB gains for grayscale-to-color transport, 5–8 dB gains for HR-to-LR scenarios, up to 3 dB HDR gains, and 20–30% bitrate reductions on multi-channel textures, with LPIPS-driven gains reaching ~30% at equal perceptual quality. The work suggests that standard codecs can be made universal and adaptable to non-traditional content and distortion measures, enabling practical, scalable improvements for next-generation compression and streaming systems.
Abstract
We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, but more importantly, adapts the codec to other types of image/video content and to other distortion measures. The sandwich learns to transmit ``neural code images'' that optimize and improve overall rate-distortion performance, with the improvements becoming significant especially when the overall problem is well outside of the scope of the codec's design. We apply the sandwich architecture to standard codecs with mismatched sources transporting different numbers of channels, higher resolution, higher dynamic range, computer graphics, and with perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 30\% bitrate reductions) compared to alternative adaptations. We establish optimality properties for sandwiched compression and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in video compression and streaming.
