Table of Contents
Fetching ...

Region-Adaptive Transform with Segmentation Prior for Image Compression

Yuxi Liu, Wenhan Yang, Huihui Bai, Yunchao Wei, Yao Zhao

TL;DR

This work introduces the class-agnostic segmentation masks for extracting region-adaptive contextual information and is the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR).

Abstract

Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The source code is available at https://github.com/GityuxiLiu/SegPIC-for-Image-Compression.

Region-Adaptive Transform with Segmentation Prior for Image Compression

TL;DR

This work introduces the class-agnostic segmentation masks for extracting region-adaptive contextual information and is the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR).

Abstract

Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The source code is available at https://github.com/GityuxiLiu/SegPIC-for-Image-Compression.
Paper Structure (15 sections, 8 equations, 11 figures, 4 tables)

This paper contains 15 sections, 8 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Paradigms of end-to-end learned image compression (LIC) methods. (a) Conventional LIC with fixed learned kernels. (b) Data-dependent LIC with image-specific kernels wang2022neural. (c) The proposed method is capable of generating region-adaptive kernels from different regions, providing more fine-grained transforms.
  • Figure 2: (a) Overall framework of our SegPIC. The blue flow lines represent the decoding process on the decoder side. The dashed lines represent the process of extracting the prototypes $p',p$ (see Eq. \ref{['eq.map']}). The compression of $p',p$ is not fully depicted for simplicity, please refer to Fig. \ref{['pro_compress']} for details. $N$ is set to 192, and $M$ is 320 as in stf2022. RAT is the proposed Region-Adaptive Transform. SAL is the proposed Scale Affine Layer. WAM is Window Attention Module stf2022. ChARM is Channel-wise Auto-Regressive Model minnen2020channel. FM is a Factorized Model balle2018. GDN is Generalized Divisive Normalizationballe2016. (b) is Downsample Block. "Downsample, N" in (a) means the block with out-channel N. "Conv C,5,2" means out-channel C, kernel size 5, stride 2. (c) is Upsample Block, and "TConv" is Transposed Convolution.
  • Figure 3: The diagram of extracting the prototypes and the proposed RAT module. DPSConv is proposed Depth&Point-wise Separable Convolution (see Eq. \ref{['eq.dpsconv']}). CTL is Channel Transform Layers, DKG is the DPSConv Kernel Generator, and CAG is the Channel Attention Generator. DPSConv is Depth&Point-wise Separable Convolution. The detailed modules are represented in Fig. \ref{['modules']}.
  • Figure 4: The architecture of the Prototype Encoder and Decoder for compressing and transmitting the prototypes. "Linear $\text{C}_1$,$\text{C}_2$" means linear layer with in-channel $\text{C}_1$ and out-channel $\text{C}_2$.
  • Figure 5: The detailed modules in RAT, including Channel Transform layers (CTL), DPSConv Kernels Generator (DKG), and Channel Attention Generator (CAG). LReLU is Leaky ReLU. GAP is Global Average Pooling. "GConv $\text{k}^2$C,3,1,$C$" means Grouped Convolution with out-channel $\text{k}^2$C, kernel size 3, stride 1 and group $C$.
  • ...and 6 more figures