Table of Contents
Fetching ...

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

Hanyue Tu, Siqi Wu, Li Li, Wengang Zhou, Houqiang Li

TL;DR

This paper tackles the limitations of autoencoder-based learned image compression by introducing a lightweight multi-scale invertible neural network that bijectively maps images to latent representations, enabling true information preservation during quantization. It combines a four-level invertible transform with a multi-scale spatial-channel context model and extended gain units to support wide-range, variable-rate compression from a single model. Experimental results show state-of-the-art performance across a broad bitrate range, outperforming VVC in many regimes and remaining competitive with multi-model learned approaches, while achieving superior fidelity under repeated re-encodings. The approach offers practical benefits in terms of model size, training efficiency, and robustness, making invertible transforms a compelling alternative for high-bitrate image compression.

Abstract

Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC.

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

TL;DR

This paper tackles the limitations of autoencoder-based learned image compression by introducing a lightweight multi-scale invertible neural network that bijectively maps images to latent representations, enabling true information preservation during quantization. It combines a four-level invertible transform with a multi-scale spatial-channel context model and extended gain units to support wide-range, variable-rate compression from a single model. Experimental results show state-of-the-art performance across a broad bitrate range, outperforming VVC in many regimes and remaining competitive with multi-model learned approaches, while achieving superior fidelity under repeated re-encodings. The approach offers practical benefits in terms of model size, training efficiency, and robustness, making invertible transforms a compelling alternative for high-bitrate image compression.

Abstract

Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC.

Paper Structure

This paper contains 24 sections, 9 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Comparison of compression performance and model size. BD-rate over VTM-12.1 on Kodak is reported. A lower BD-rate indicates better performance. Our method, Cui2021 cui2021asymmetric and Cai2022 cai2022high are variable-rate models, while Xie2021 xie2021enhanced, Zou2022 zou2022devil and Liu2023 liu2023learned are fixed-rate models that require multiple models for different rates ("$\times n$" means that $n$ models are required). Our method surpasses VTM-12.1 across a wide range of rates with only a single lightweight variable-rate model.
  • Figure 2: Framework of our proposed invertible transform-based image compression model. Four-level invertible blocks map the input image to multi-scale latent representations $\{\bm{y}_1, \bm{y}_2,\bm{y}_3,\bm{y}_4,\bm{y}_5\}$, which are quantized and passed through the reverse process of the network to reconstruct the image. The post-processing module is used to compensate for the quantization loss. "QECG" includes the process of quantization and entropy coding with gain units. "G" and "IG" denote gain units and inverse gain units, respectively. "MS-SCCTX" is the proposed multi-scale spatial-context model. "Q" represents quantization. "AE" and "AD" stand for arithmetic encoding and decoding, respectively.
  • Figure 3: Invertible down-scaling using the space-to-depth module.
  • Figure 4: Architecture of the invertible unit. We stack $N$ invertible units within each invertible block.
  • Figure 5: Proposed spatial-channel context model with extended gain units for entropy coding. The architecture of the latent residual prediction (LRP) module is the same as the channel-wise context model.
  • ...and 8 more figures