An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

Xiang Liu; Jiahong Chen; Bin Chen; Zimo Liu; Baoyi An; Shu-Tao Xia; Zhi Wang

An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

Xiang Liu, Jiahong Chen, Bin Chen, Zimo Liu, Baoyi An, Shu-Tao Xia, Zhi Wang

TL;DR

This work tackles efficient INR-based image compression for edge devices by introducing a Mixed AutoRegressive Model (MARM) that blends channel-wise ARU for low-resolution latents with classic autoregression for high-resolution detail. A two-pass checkerboard decoding strategy and a mixed synthesis module further enhance parallelism and reconstruction quality, enabling substantial decoding-time reductions. Empirical results show GPU decoding speedups of roughly 10–20x with comparable rate–distortion performance to prior INR codecs and competitive quality even against Hyperprior, all while maintaining low decoding complexity. The approach significantly improves practicality of INR-based codecs for edge and AR/VR devices by balancing speed, quality, and resource usage.

Abstract

Displaying high-quality images on edge devices, such as augmented reality devices, is essential for enhancing the user experience. However, these devices often face power consumption and computing resource limitations, making it challenging to apply many deep learning-based image compression algorithms in this field. Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models: low computational complexity and parameter-free decoding. It also outperforms many traditional and early neural compression methods in terms of quality. In this study, we introduce a new Mixed AutoRegressive Model (MARM) to significantly reduce the decoding time for the current INR codec, along with a new synthesis network to enhance reconstruction quality. MARM includes our proposed AutoRegressive Upsampler (ARU) blocks, which are highly computationally efficient, and ARM from previous work to balance decoding time and reconstruction quality. We also propose enhancing ARU's performance using a checkerboard two-stage decoding strategy. Moreover, the ratio of different modules can be adjusted to maintain a balance between quality and speed. Comprehensive experiments demonstrate that our method significantly improves computational efficiency while preserving image quality. With different parameter settings, our method can achieve over a magnitude acceleration in decoding time without industrial level optimization, or achieve state-of-the-art reconstruction quality compared with other INR codecs. To the best of our knowledge, our method is the first INR-based codec comparable with Hyperprior in both decoding speed and quality while maintaining low complexity.

An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

TL;DR

Abstract

Paper Structure (17 sections, 21 equations, 10 figures, 3 tables)

This paper contains 17 sections, 21 equations, 10 figures, 3 tables.

Introduction
Related Work
Neural Image Compression
Implicit Neural Representation
INR Based image compression
Method
System Overview
Mixed Autoregressive Model
Two Pass ARU
Mixed Synthesis Module
Complexity Analysis
Experiment
Datasets, Metrics and Experiments Setup
Image Compression
Implicit Neural Representation
...and 2 more sections

Figures (10)

Figure 1: Overall performance of different methods on CLIC dataset. The area represents the decoding complexity in MACs/pixel (small is better). BD-Time is a metric measuring the decoding speed among different qualities like BD-Rate, which we will discuss more in Section \ref{['sec:exp:subsec1']}. Note on parallel accelerators like GPU, low complexity does not necessarily mean fast decoding and vice versa (e.g. Hyperprior Ball2018Variational and COOL-CHIC ladune2022cool). Our method achieves a better trade-off than existing methods
Figure 2: Asymmetric compression and decompression framework of the INR codec.
Figure 3: System architecture. $\psi$, $\phi$, $\theta$ are network parameters for MARM module, upsample module and synthesis module respectively. These parameters are decoded first to initialize the model. Then the MARM module decodes latents $\hat{\boldsymbol{y}}$ which are integer values matrices with pyramid shapes. Upsample module will transform these latents to a dense representation $\hat{\boldsymbol{z}}$ with shape $L\times H\times W$. Synthesis module transforms the dense tensor to image with $(r, g, b)$ channels and $H\times W$ shape. Note since the code length of $\psi, \phi, \theta$ is vary small, we use original notation represents both quantized and unquantized version for simplicity. Top part of MARM is channel-wise auto-regressive model, which predict entropy parameters conditioned on previous latent. Bottom part correspond to inner-channel auto-regression model generation parameters pixel by pixel. Given the number of ARM blocks $M$ and total latents number $L$, we note $i=L-M-1$, $j=L-M$ and $k=L-1$. $\mu_0$ and $\sigma_0$ are initialized to tensor with 0 and $\exp(-0.5)$ respectively for all images. $\mu$ and $\sigma$ with subscription means the values output as a matrix rather than serially generated scalar for those without subscription.
Figure 4: Network architecture of ARU (Fig. \ref{['fig:aru_details_sub']}). ARU uses previous parameter matrix $\mu_{i-1}$, $\sigma_{i-1}$ and latents as input. ARU pass1 output decoding parameters $\mu_i^{\diamond}, \sigma_i^{\diamond}$ for anchors in latents. ARU Pass2 generate parameters for non-anchors conditioned on $\mu_i^{\diamond}, \sigma_i^{\diamond}$ and anchors. Checkerboard means to merge anchor of the first tensor and non-anchor of the second tensor. Fig. \ref{['fig:latent_vis']} provides an example of latent in decoding process.
Figure 5: Detailed structure of ARU and Synthesis network. The input and output channel of all Conv module are defined according to their input tensor. For example, if we use latents $L=7$, $\hat{\boldsymbol{z}}$ in Fig. \ref{['fig:sythesis']} is a $1\times L\times H \times W$ tensor. Then input and output channel of SeparableConv is 7.
...and 5 more figures

An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

TL;DR

Abstract

An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (10)