Table of Contents
Fetching ...

Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

Huairui Wang, Nianxiang Fu, Zhenzhong Chen, Shan Liu

TL;DR

The paper tackles fixed-range spatial aggregation in learned image compression by introducing dynamic kernel-based adaptive aggregation, bridging CNN-based transforms and Transformer-style attention. It develops a Dynamic Residual Block Group that uses Lite DCN to generate content-conditioned offsets and shared weights, enabling adaptive receptive fields with controlled complexity. A generalized coarse-to-fine entropy model is proposed, incorporating a dynamic hyper-prior for expressive global context and an Asymmetric Spatial-channel Entropy Model to reduce redundancy across latents. Empirical results on Kodak, CLIC, and Tecnick show DKIC outperforms traditional codecs (BPG, VTM-12.1) and recent LIC methods, with favorable model size and inference speed, highlighting its practical impact for high-efficiency image compression.

Abstract

Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.

Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

TL;DR

The paper tackles fixed-range spatial aggregation in learned image compression by introducing dynamic kernel-based adaptive aggregation, bridging CNN-based transforms and Transformer-style attention. It develops a Dynamic Residual Block Group that uses Lite DCN to generate content-conditioned offsets and shared weights, enabling adaptive receptive fields with controlled complexity. A generalized coarse-to-fine entropy model is proposed, incorporating a dynamic hyper-prior for expressive global context and an Asymmetric Spatial-channel Entropy Model to reduce redundancy across latents. Empirical results on Kodak, CLIC, and Tecnick show DKIC outperforms traditional codecs (BPG, VTM-12.1) and recent LIC methods, with favorable model size and inference speed, highlighting its practical impact for high-efficiency image compression.

Abstract

Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
Paper Structure (28 sections, 6 equations, 9 figures, 4 tables)

This paper contains 28 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Characteristic differences between the ordinary kernel, window-based attention, and our dynamic kernel. For the visualization of the effective receptive fields, we choose Cheng2020cheng2020learned and STFzou2022devil as the representative methods using ordinary kernel and window-based attention, respectively. The red point in (a) denotes the target point.
  • Figure 2: Overview of our proposed image compression framework DKIC. We use $g_a$ to transform the input $x$ into latent representation $y$, and propose Asymmetric Space-channel Entropy Model to estimate the distribution parameters ($\mu$ and $\sigma$) of $y$. Following minnen2018joint, we quantize and compress $y-\mu$ to the bitstream. After entropy decoding, we restore the image $\hat{x}$ from $\hat{y}$ with inverse transform network $g_s$.
  • Figure 3: Visualization of dynamic sampling locations. The left figures are kodim7 with the red target aggregation point, and the right figures contain the sampling locations of dynamic kernel. Different color denotes different group the points belong to.
  • Figure 4: Visualization of the average value of the unevenly grouped latent feature. It can been seen from the figures that the former coding slices have larger symbol magnitudes, and have stronger spatial correlation in the neighborhood.
  • Figure 5: Description of the Asymmetric Spatial-channel Entropy Model. We split the latents $y$ into 5 slices. Every slice has global context $gc$ from hyper-prior. Besides, considering the different spatial correlation in each slice, we use the 4-stage spatial context model to estimate distribution parameters of the first two slices, and we adopt the 2-stage spatial context model to the subsequent slices. The subscript and superscript of $\boldsymbol{y}$ denote the sub part of the latent in channel and spatial dimension respectively.
  • ...and 4 more figures