Table of Contents
Fetching ...

Super-High-Fidelity Image Compression via Hierarchical-ROI and Adaptive Quantization

Jixiang Luo, Yan Wang, Hongwei Qin

TL;DR

This work tackles blur and deformation in learned image compression at very low bitrates by integrating Hierarchical-ROI (H-ROI) to allocate bits across multiple foreground regions and a background, with channel-wise non-linear adaptive quantization to tightly control bitrate. Built on an ELIC-based architecture, the method optimizes a rate-distortion objective while employing saliency-driven ROI masks, GAN/perceptual losses, and progressive decoding across ROI layers. Empirical results show substantial LPIPS improvements and significant bit-rate reductions relative to BPG and HiFiC, with especially pronounced gains for small faces and text, while preserving PSNR/MS-SSIM. The approach demonstrates that content-aware ROI masking and non-linear, multi-channel quantization can push LIC performance at low bitrate envelopes, offering practical gains for visual quality and potential machine-coding applications.

Abstract

Learned Image Compression (LIC) has achieved dramatic progress regarding objective and subjective metrics. MSE-based models aim to improve objective metrics while generative models are leveraged to improve visual quality measured by subjective metrics. However, they all suffer from blurring or deformation at low bit rates, especially at below $0.2bpp$. Besides, deformation on human faces and text is unacceptable for visual quality assessment, and the problem becomes more prominent on small faces and text. To solve this problem, we combine the advantage of MSE-based models and generative models by utilizing region of interest (ROI). We propose Hierarchical-ROI (H-ROI), to split images into several foreground regions and one background region to improve the reconstruction of regions containing faces, text, and complex textures. Further, we propose adaptive quantization by non-linear mapping within the channel dimension to constrain the bit rate while maintaining the visual quality. Exhaustive experiments demonstrate that our methods achieve better visual quality on small faces and text with lower bit rates, e.g., $0.7X$ bits of HiFiC and $0.5X$ bits of BPG.

Super-High-Fidelity Image Compression via Hierarchical-ROI and Adaptive Quantization

TL;DR

This work tackles blur and deformation in learned image compression at very low bitrates by integrating Hierarchical-ROI (H-ROI) to allocate bits across multiple foreground regions and a background, with channel-wise non-linear adaptive quantization to tightly control bitrate. Built on an ELIC-based architecture, the method optimizes a rate-distortion objective while employing saliency-driven ROI masks, GAN/perceptual losses, and progressive decoding across ROI layers. Empirical results show substantial LPIPS improvements and significant bit-rate reductions relative to BPG and HiFiC, with especially pronounced gains for small faces and text, while preserving PSNR/MS-SSIM. The approach demonstrates that content-aware ROI masking and non-linear, multi-channel quantization can push LIC performance at low bitrate envelopes, offering practical gains for visual quality and potential machine-coding applications.

Abstract

Learned Image Compression (LIC) has achieved dramatic progress regarding objective and subjective metrics. MSE-based models aim to improve objective metrics while generative models are leveraged to improve visual quality measured by subjective metrics. However, they all suffer from blurring or deformation at low bit rates, especially at below . Besides, deformation on human faces and text is unacceptable for visual quality assessment, and the problem becomes more prominent on small faces and text. To solve this problem, we combine the advantage of MSE-based models and generative models by utilizing region of interest (ROI). We propose Hierarchical-ROI (H-ROI), to split images into several foreground regions and one background region to improve the reconstruction of regions containing faces, text, and complex textures. Further, we propose adaptive quantization by non-linear mapping within the channel dimension to constrain the bit rate while maintaining the visual quality. Exhaustive experiments demonstrate that our methods achieve better visual quality on small faces and text with lower bit rates, e.g., bits of HiFiC and bits of BPG.
Paper Structure (14 sections, 10 equations, 14 figures, 1 table)

This paper contains 14 sections, 10 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: The visual quality of Kodim14 with H-ROI v.s. BPG and HiFiC. Our method shows higher fidelity for human faces and text on the boat with a smaller bpp.
  • Figure 2: Hierarchical-ROI with a salient object detection network. $I$ is the original image. $F_{i},i=1,2,3, B_{i},i=1,2,3$ represent the foreground and background of the $i_{th}$ layer. The right column is visualization of salient objects in yellow.
  • Figure 3: Diagram of the network adopted. The right part is ELIC elic. We use the same architecture for $g_a, g_s, h_a$ and $h_s$ as the original paper. Context denotes the spatial-channel context model described in ELIC. Q, AQ are the quantization and adaptive quantization. AE, AD are the arithmetic encoding and decoding. The left part shows the adversarial training $g_d$, which has the same discriminator structure as HiFiC mentzer2020high, and perceptual learning $g_v$ which we train with VGG network simonyan2014very and $l_1$ loss. We use MSE loss $mse_{i}, i=0,1,2$ for foregrounds at different levels.
  • Figure 4: PCSA is simplified from gu2020PCSA in H-ROI. MobileNetV3 howard2019searching is used to extract low-dimensional and high-dimensional features. Conv16-1x1 represents the convolutional layer with $1 \times 1$ kernel and $16$ output channels, while DConv8-3x3 denotes the dilated convolutional layer with dilation $3$ and $8$ output channels. BatchNorm and PReLU are the activation function. interpolate denotes the bilinear upsampling.
  • Figure 5: The influence of quantization with different layers. ${layer_1}$ means no adaptive quantization, ${layer_2, layer_3, layer_4}$ means $\epsilon_1, \epsilon_2, \epsilon_3$ are applied for quantization.
  • ...and 9 more figures