Table of Contents
Fetching ...

Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers

Onur G. Guleryuz, Philip A. Chou, Berivan Isik, Hugues Hoppe, Danhang Tang, Ruofei Du, Jonathan Taylor, Philip Davidson, Sean Fanello

TL;DR

The paper introduces a sandwich architecture that places a standard codec between neural pre- and post-processors, trained end-to-end via a differentiable codec proxy to optimize the rate-distortion objective $D(R)$. It demonstrates substantial RD gains across diverse image/video scenarios, including mismatched channels, high dynamic range, and non-RGB data, as well as with perceptual metrics such as LPIPS. Theoretical results bound the performance of the optimal sandwich and show how the inner codebook can be repurposed with a rate penalty $D(p||q)$, while practical proxies enable training with JPEG/HEVC/AV1-like codecs. Empirically, the approach yields 6–9 dB gains for grayscale-to-color transport, 5–8 dB gains for HR-to-LR scenarios, up to 3 dB HDR gains, and 20–30% bitrate reductions on multi-channel textures, with LPIPS-driven gains reaching ~30% at equal perceptual quality. The work suggests that standard codecs can be made universal and adaptable to non-traditional content and distortion measures, enabling practical, scalable improvements for next-generation compression and streaming systems.

Abstract

We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, but more importantly, adapts the codec to other types of image/video content and to other distortion measures. The sandwich learns to transmit ``neural code images'' that optimize and improve overall rate-distortion performance, with the improvements becoming significant especially when the overall problem is well outside of the scope of the codec's design. We apply the sandwich architecture to standard codecs with mismatched sources transporting different numbers of channels, higher resolution, higher dynamic range, computer graphics, and with perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 30\% bitrate reductions) compared to alternative adaptations. We establish optimality properties for sandwiched compression and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in video compression and streaming.

Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers

TL;DR

The paper introduces a sandwich architecture that places a standard codec between neural pre- and post-processors, trained end-to-end via a differentiable codec proxy to optimize the rate-distortion objective . It demonstrates substantial RD gains across diverse image/video scenarios, including mismatched channels, high dynamic range, and non-RGB data, as well as with perceptual metrics such as LPIPS. Theoretical results bound the performance of the optimal sandwich and show how the inner codebook can be repurposed with a rate penalty , while practical proxies enable training with JPEG/HEVC/AV1-like codecs. Empirically, the approach yields 6–9 dB gains for grayscale-to-color transport, 5–8 dB gains for HR-to-LR scenarios, up to 3 dB HDR gains, and 20–30% bitrate reductions on multi-channel textures, with LPIPS-driven gains reaching ~30% at equal perceptual quality. The work suggests that standard codecs can be made universal and adaptable to non-traditional content and distortion measures, enabling practical, scalable improvements for next-generation compression and streaming systems.

Abstract

We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, but more importantly, adapts the codec to other types of image/video content and to other distortion measures. The sandwich learns to transmit ``neural code images'' that optimize and improve overall rate-distortion performance, with the improvements becoming significant especially when the overall problem is well outside of the scope of the codec's design. We apply the sandwich architecture to standard codecs with mismatched sources transporting different numbers of channels, higher resolution, higher dynamic range, computer graphics, and with perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 30\% bitrate reductions) compared to alternative adaptations. We establish optimality properties for sandwiched compression and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in video compression and streaming.
Paper Structure (26 sections, 3 theorems, 21 equations, 27 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 3 theorems, 21 equations, 27 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

[Optimal Sandwich] Let $X$ be a $\mathbb{R}^n$-valued bounded source, let $d$ be a distortion measure, and let $D(R)$ be the operational distortion-rate function for $X$ under $d$. For any $\epsilon>0$, let $(\alpha^*,\beta^*,\gamma^*)$ be the encoding, decoding, and lossless coding maps for a rate-

Figures (27)

  • Figure 1: The sandwich architecture can accomplish surprising results even with a simple codec (here JPEG 4:0:0, a single-channel grayscale codec). The neural pre-processor is able to encode the full RGB image in (a) into a grayscale image of neural codes in (b). The neural codes are low-frequency dither-like patterns that modulate the color information yet also survive JPEG compression (c). At the decoding end, the neural post-processor demodulates the patterns to faithfully recover the color while also achieving deblocking. The interested reader can generate an extensive set of further examples using our software at sandwich_oss.
  • Figure 2: Analogue of \ref{['fig:modulation_dithering1']} for video and HEVC. The sandwich is used to transport full color video over a gray-scale codec (HEVC 4:0:0). First, fifth, and tenth frames of compressed bottlenecks, final reconstructions by the post-processor, and original source videos are shown. Rate=0.07 bpp, PSNR=36.0 dB. The sandwich establishes temporally coherent modulation-like patterns on the bottlenecks through which the pre-processor encodes color that are then demodulated by the post-processor for a full-color result. The patterns are spatially broader compared to those in \ref{['fig:modulation_dithering1']} to facilitate more efficient motion compensation. The interested reader can generate an extensive set of further examples using our software at sandwich_oss.
  • Figure 3: Neural-sandwiched image codec during (a) operation and (b) training. Gray boxes are not differentiable; blue are differentiable; green are trainable. Loss function for training is $\sum D_n+\lambda R_n$ over example images $n$.
  • Figure 4: Neural pre-processor and post-processor.
  • Figure 5: Image codec proxy.
  • ...and 22 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Proposition 2: Stronger form of Proposition \ref{['thm:proposition1']} - Optimal Sandwich
  • Remark
  • Remark
  • proof
  • Proposition 3
  • proof