Table of Contents
Fetching ...

Region-Adaptive Generative Compression with Spatially Varying Diffusion Models

Lucas Relic, Roberto Azevedo, Yang Zhang, Stephan Mandt, Markus Gross, Christopher Schroers

Abstract

Generative image codecs aim to optimize perceptual quality, producing realistic and detailed reconstructions. However, they often overlook a key property of human vision: our tendency to focus on particular aspects of a visual scene (e.g., salient objects) while giving less importance to other regions. An ideal perceptual codec should be able to exploit this property by allocating more representational capacity to perceptually important areas. To this end, we propose a region-adaptive diffusion-based image codec that supports non-uniform bit allocation within an image. We design a novel spatially varying diffusion model capable of denoising varying amounts of noise per pixel according to arbitrary importance maps. We further identify that these maps can serve as effective priors on the latent representation, and integrate them into our entropy model, improving rate-distortion performance. Built on these contributions, our spatially-adaptive diffusion-based codec outperforms state-of-the-art ROI-controllable baselines in both full-image and ROI-masked perceptual quality.

Region-Adaptive Generative Compression with Spatially Varying Diffusion Models

Abstract

Generative image codecs aim to optimize perceptual quality, producing realistic and detailed reconstructions. However, they often overlook a key property of human vision: our tendency to focus on particular aspects of a visual scene (e.g., salient objects) while giving less importance to other regions. An ideal perceptual codec should be able to exploit this property by allocating more representational capacity to perceptually important areas. To this end, we propose a region-adaptive diffusion-based image codec that supports non-uniform bit allocation within an image. We design a novel spatially varying diffusion model capable of denoising varying amounts of noise per pixel according to arbitrary importance maps. We further identify that these maps can serve as effective priors on the latent representation, and integrate them into our entropy model, improving rate-distortion performance. Built on these contributions, our spatially-adaptive diffusion-based codec outperforms state-of-the-art ROI-controllable baselines in both full-image and ROI-masked perceptual quality.

Paper Structure

This paper contains 47 sections, 6 equations, 18 figures.

Figures (18)

  • Figure 1: High-level illustration of our region-aware generative image compression framework. Based on the region of interest, our method adaptively allocates bits for different areas of the image. Regions of interest are devoted more representational capacity, and thus are more accurately reconstructed. Content in less important areas is expectedly more generated, but still seamlessly integrates into the scene while looking as realistic as possible.
  • Figure 2: Pipeline of our proposed method. An image $\mathbf{x}$ and ROI map $\mathbf{t}$ are taken as input and the former encoded to latent space with a VAE encoder. The image latent $\mathbf{y}_{\mathbf{t}}$ is spatially adaptively quantized according to the ROI map, adding noise in the process. $\mathbf{y}_{\mathbf{t}}$ and $\mathbf{t}$ are then transmitted to the bitstream using our proposed timestep conditioned entropy model. At the receiver, $\mathbf{y}_{\mathbf{t}}$ and $\mathbf{t}$ are processed with our spatially adaptive diffusion model, including timestep resampling, which reconstructs the source latent. It is then decoded with a VAE decoder to produce the final reconstruction $\hat{\mathbf{x}}$.
  • Figure 3: (a) Our timestep resampling process. For each unique noise level of pixels in $\hat{\mathbf{y}}_{\mathbf{t}}$ (denoted by different colors), we adjust the am noise removed per step such that all pixels are fully denoised in the same number of diffusion forward evaluations. (b) Examples of the spatial $\mathbf{t}$ maps we sample during training. The region shape, number of regions, and timestep value of each region is randomly sampled.
  • Figure 4: Our Timestep Conditioned Entropy Model architecture. The ROI map $\mathbf{t}$ is first losslessly encoded and transmitted to the receiver. It is then used as prior knowledge for entropy coding, along with hierarchical and channel-slice context, by predicting Gaussian parameters that define the distribution of $\mathbf{y}_{\mathbf{t}}$.
  • Figure 5: Quantitative rate-distortion comparison of spatially adaptive baselines on the Kodak and CLIC 2022 datasets. Performance is measured by full image (solid line) and in-ROI (dotted line) LPIPS, where our method is best performing in both metrics. UDDQ-ROI is competitive at high bitrates, but quickly degrades as bitrate decreases, while the other baselines do not show strong performance in perceptual metrics.
  • ...and 13 more figures