Table of Contents
Fetching ...

3DM-WeConvene: Learned Image Compression with 3D Multi-Level Wavelet-Domain Convolution and Entropy Model

Haisheng Fu, Jie Liang, Feng Liang, Zhenman Fang, Guohe Zhang, Jingning Han

TL;DR

This work introduces 3DM-WeConvene, a learned image compression framework that plugs a 3D multi-level wavelet-domain convolution (3DM-WeConv) and a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM) into LIC, aiming to reduce frequency-domain redundancy. By applying a 3D DWT across channels and multi-level spatial transforms, and by using LF-first, channel-wise entropy coding in the wavelet domain, the method achieves notable BD-Rate reductions compared to H.266/VVC, while maintaining favorable model size and runtime. A two-stage training strategy further improves rate allocation between low- and high-frequency subbands. Beyond image compression, the 3DM-WeConv layers show promise in video compression, image classification, segmentation, and denoising, highlighting the broad applicability of frequency-domain processing in deep learning pipelines.

Abstract

Learned image compression (LIC) has recently made significant progress, surpassing traditional methods. However, most LIC approaches operate mainly in the spatial domain and lack mechanisms for reducing frequency-domain correlations. To address this, we propose a novel framework that integrates low-complexity 3D multi-level Discrete Wavelet Transform (DWT) into convolutional layers and entropy coding, reducing both spatial and channel correlations to improve frequency selectivity and rate-distortion (R-D) performance. Our proposed 3D multi-level wavelet-domain convolution (3DM-WeConv) layer first applies 3D multi-level DWT (e.g., 5/3 and 9/7 wavelets from JPEG 2000) to transform data into the wavelet domain. Then, different-sized convolutions are applied to different frequency subbands, followed by inverse 3D DWT to restore the spatial domain. The 3DM-WeConv layer can be flexibly used within existing CNN-based LIC models. We also introduce a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM), which performs slice-based entropy coding in the 3D DWT domain. Low-frequency (LF) slices are encoded first to provide priors for high-frequency (HF) slices. A two-step training strategy is adopted: first balancing LF and HF rates, then fine-tuning with separate weights. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art CNN-based LIC methods in R-D performance and computational complexity, with larger gains for high-resolution images. On the Kodak, Tecnick 100, and CLIC test sets, our method achieves BD-Rate reductions of -12.24%, -15.51%, and -12.97%, respectively, compared to H.266/VVC.

3DM-WeConvene: Learned Image Compression with 3D Multi-Level Wavelet-Domain Convolution and Entropy Model

TL;DR

This work introduces 3DM-WeConvene, a learned image compression framework that plugs a 3D multi-level wavelet-domain convolution (3DM-WeConv) and a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM) into LIC, aiming to reduce frequency-domain redundancy. By applying a 3D DWT across channels and multi-level spatial transforms, and by using LF-first, channel-wise entropy coding in the wavelet domain, the method achieves notable BD-Rate reductions compared to H.266/VVC, while maintaining favorable model size and runtime. A two-stage training strategy further improves rate allocation between low- and high-frequency subbands. Beyond image compression, the 3DM-WeConv layers show promise in video compression, image classification, segmentation, and denoising, highlighting the broad applicability of frequency-domain processing in deep learning pipelines.

Abstract

Learned image compression (LIC) has recently made significant progress, surpassing traditional methods. However, most LIC approaches operate mainly in the spatial domain and lack mechanisms for reducing frequency-domain correlations. To address this, we propose a novel framework that integrates low-complexity 3D multi-level Discrete Wavelet Transform (DWT) into convolutional layers and entropy coding, reducing both spatial and channel correlations to improve frequency selectivity and rate-distortion (R-D) performance. Our proposed 3D multi-level wavelet-domain convolution (3DM-WeConv) layer first applies 3D multi-level DWT (e.g., 5/3 and 9/7 wavelets from JPEG 2000) to transform data into the wavelet domain. Then, different-sized convolutions are applied to different frequency subbands, followed by inverse 3D DWT to restore the spatial domain. The 3DM-WeConv layer can be flexibly used within existing CNN-based LIC models. We also introduce a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM), which performs slice-based entropy coding in the 3D DWT domain. Low-frequency (LF) slices are encoded first to provide priors for high-frequency (HF) slices. A two-step training strategy is adopted: first balancing LF and HF rates, then fine-tuning with separate weights. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art CNN-based LIC methods in R-D performance and computational complexity, with larger gains for high-resolution images. On the Kodak, Tecnick 100, and CLIC test sets, our method achieves BD-Rate reductions of -12.24%, -15.51%, and -12.97%, respectively, compared to H.266/VVC.

Paper Structure

This paper contains 23 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The decoding time, model size, and BD-Rate reductions over H.266/VVC for different LIC schemes on the Kodak test set. The area of each circle is proportional to the number of parameters (also written in the figure) of each model. Note that the unit of the left subfigure is milliseconds (ms), while the unit of the right subfigure is seconds (s). Our method achieves the best trade-off among the three metrics.
  • Figure 2: The overall architecture of the proposed 3DM-WeConvene scheme. The details of the 3DM-WeConv layer and the 3DWeChARM module are in Fig. \ref{['fig:3DWeconv']} and Fig. \ref{['channel_wise_entropy_model']} respectively. Conv(3, s, N) represents a convolutional layer with a $3 \times 3$ kernel size, stride $s$, and $N$ filters, while TConv(3, s, N) denotes a transposed convolutional layer. Dashed shortcut connections indicate changes in tensor size. The abbreviations AE and AD refer to the Arithmetic Encoder and Arithmetic Decoder in entropy coding, respectively.
  • Figure 3: Forward 3DM-WeConv layer with downsampling.
  • Figure 4: The 3D illustration of the 3DM-WeConv layer in Fig. \ref{['fig:3DWeconv']}.
  • Figure 5: Details of the 3DWeChARM module in entropy coding. The latent representation $y$ after 3D DWT is divided into slices, which are coded sequentially. The details of each slice coding are shown in Fig. \ref{['fig:wecharm_structure']}.
  • ...and 8 more figures