Table of Contents
Fetching ...

Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

Jona Ballé, Luca Versari, Emilien Dupont, Hyunjik Kim, Matthias Bauer

TL;DR

The paper addresses the challenge of achieving good image quality at low bitrate with fast decoding in learned image compression. It adapts the C3 overfitted codec by replacing the traditional distortion with Wasserstein Distortion ($\mathrm{WD}$) and by supplying common randomness ($\mathrm{CR}$) to the decoder, while keeping decoding costs very low. Through a large human-rated study, the WD-based approach yields a perceptual quality–bitrate performance on par with generative codecs like HiFiC, but with orders of magnitude fewer MACs during decoding, and WD correlates with human judgments at a Pearson coefficient around $0.94$. The findings suggest that modeling perceptual texture via WD can achieve good, cheap, and fast compression without fully modeling the data distribution, albeit with some encoding-time costs and ad hoc design choices for the $\sigma$-map that warrant further refinement.

Abstract

Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study, showing that WD clearly outperforms LPIPS as an optimization objective. The study also reveals that WD outperforms other perceptual metrics such as LPIPS, DISTS, and MS-SSIM as a predictor of human ratings, remarkably achieving over 94% Pearson correlation with Elo scores.

Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

TL;DR

The paper addresses the challenge of achieving good image quality at low bitrate with fast decoding in learned image compression. It adapts the C3 overfitted codec by replacing the traditional distortion with Wasserstein Distortion () and by supplying common randomness () to the decoder, while keeping decoding costs very low. Through a large human-rated study, the WD-based approach yields a perceptual quality–bitrate performance on par with generative codecs like HiFiC, but with orders of magnitude fewer MACs during decoding, and WD correlates with human judgments at a Pearson coefficient around . The findings suggest that modeling perceptual texture via WD can achieve good, cheap, and fast compression without fully modeling the data distribution, albeit with some encoding-time costs and ad hoc design choices for the -map that warrant further refinement.

Abstract

Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study, showing that WD clearly outperforms LPIPS as an optimization objective. The study also reveals that WD outperforms other perceptual metrics such as LPIPS, DISTS, and MS-SSIM as a predictor of human ratings, remarkably achieving over 94% Pearson correlation with Elo scores.

Paper Structure

This paper contains 7 sections, 4 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Crop of image 28 from the CLIC2020 professional dataset (best viewed on screen). Top left: Original image. Top right: C3 optimized for MSE, compressed to 0.172 bits/pixel. Bottom left: C3 optimized for WD with $\sigma=8$, compressed to 0.167 bits/pixel. Bottom right: C3 optimized for WD with $\sigma=8$, without common randomness, compressed to 0.180 bits/pixel. While optimization for MSE leads to flattened texture, as seen in the reproduction of the grass and the walls of the building, texture is vastly improved in the WD-optimized versions. Not providing common randomness to the decoder (bottom right) means the codec must reproduce all textures using deterministic structure such as straight lines (as in, e.g., the texture of the roof -- note the lines not being consistent with the original, while providing a better approximation of the texture than the MSE version). This is not adequate for random textures such as the grass in front of the building, where providing common randomness significantly helps to maintain the visual quality of texture.
  • Figure 2: Decoding an image with COOL-CHIC ladune2023cool and C3 C3_2024. A. A latent element $\widehat{\mathbf{z}\xspace}^n_{ij}$ () is autoregressively decoded by applying the entropy network $g_\psi\xspace$ to the spatial context $\mathrm{c}(\mathbf{z}\xspace^n; (i, j))$ (), yielding parameters $\mu$, $\sigma$ of the Laplacian probability model used for entropy decoding the latent element. B. The decoded latent spatial arrays at multiple resolutions are first bilinearly upsampled to the target resolution and then transformed into pixel space using the synthesis network $f_\theta\xspace$. We supply common randomness at multiple resolutions by drawing i.i.d. elements from a pseudo-random number generator with fixed seed 42, which is novel to the present work. The common randomness arrays are upsampled and concatenated with the upsampled latents. Figure adapted with permission from C3_2024.
  • Figure 3: Computation of Wasserstein Distortion (WD) between two images, and . A. We extract spatial feature maps $f_i$ from selected layers of a VGG network. B. For each feature $i$, we estimate local first and second raw moments at multiple scales $\alpha$ by successively applying a linear downsampling operation $\mathcal{D}$: ${ \color C1{\mu_{i\alpha}}= } \mathcal{D}^\alpha\! f_i$ and ${ \color C0{\rho_{i\alpha}}= } \mathcal{D}^\alpha\! f_i^2$; the local standard deviations are derived by taking ${ \color C2{\nu_{i\alpha}}= } \sqrt{[b]{ { \color C0{\rho_{i\alpha}}- } { \color C1{\mu_{i\alpha}^2}} } }$ elementwise. Note that ${ \color C1{\mu_{i0}}= } f_i$ and ${ \color C2{\nu_{i0}}= } 0$. C. The local WD values for feature $i$ and scale $\alpha$ are computed elementwise as ${ \color C4{d_{i\alpha}^{\textcolor{black}{\tiny\faCrow}\textcolor{C3}{\tiny\faCrow}}}= } \sqrt{[b]{{ ( { \color C1{\mu^{\textcolor{black}{\tiny\faCrow}}_{i\alpha}}- } { \color C1{\mu^{\textcolor{C3}{\tiny\faCrow}}_{i\alpha}}) } ^2 + ( { \color C2{v^{\textcolor{black}{\tiny\faCrow}}_{i\alpha}}- } { \color C2{v^{\textcolor{C3}{\tiny\faCrow}}_{i\alpha}}) } ^2 }}}$. They are then spatially averaged, with weights $w_{i\alpha}$ derived from the $\sigma$-map, yielding the scalar WD for feature $i$, ${ \color C4{d_{i}^{\textcolor{black}{\tiny\faCrow}\textcolor{C3}{\tiny\faCrow}}}$ . The total WD for the image pair is obtained by adding the contributions from all features.
  • Figure 4: Human rating study results and decoder complexity. Left: Evaluation of image compression methods in terms of visual fidelity vs. bit rate. Error bars indicate 99th percentile. Squares indicate C3 using CR, circles indicate C3 without CR, and diamonds mark other compression methods. Right: Computational complexity of the decoder of the same methods for the middle bit rate. HiFiC slightly outperforms C3 in terms of visual quality vs. bit rate. However, it requires more than two orders of magnitude more computations at the decoder. Note that neural methods tend to have similar complexity across different bit rates and objectives. This is also true for C3; only the addition of CR increases the complexity slightly. The complexity of VVC can vary across bit rates, and can't strictly be shown in the same plot, since it doesn't use floating point operations; we plot an estimated equivalent on typical hardware debargha.
  • Figure 5: Human rating study results for alternative perceptual objectives, including WD without saliency. Error bars indicate 99th percentile. As loss functions, both MS-SSIM and LPIPS lead to instabilities. To achieve reasonable results, we clipped the gradients of the exponentiation operations in MS-SSIM (C3/MS-SSIM), and used a loss equally weighting LPIPS and MSE (C3/LPIPS). We also assessed the effect of weighting MSE using saliency (C3/wMSE).
  • ...and 7 more figures