Table of Contents
Fetching ...

Diversify, Contextualize, and Adapt: Efficient Entropy Modeling for Neural Image Codec

Jun-Hyuk Kim, Seungeon Kim, Won-Hee Lee, Dokwan Oh

TL;DR

This paper proposes a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate, and introduces a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context.

Abstract

Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently developed. They not only reduce decoding time by using smaller number of modeling steps but also maintain or even improve rate--distortion performance by leveraging more diverse contexts for backward adaptation. Despite their significant progress, we argue that their performance has been limited by the simple adoption of the design convention for forward adaptation: using only a single type of hyper latent representation, which does not provide sufficient contextual information, especially in the first modeling step. In this paper, we propose a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate. Specifically, we introduce a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context. In addition, we present a method to effectively use the diverse contexts for contextualizing the current elements to be encoded/decoded. By addressing the limitation of the previous approach, our proposed framework leads to significant performance improvements. Experimental results on popular datasets show that our proposed framework consistently improves rate--distortion performance across various bit-rate regions, e.g., 3.73% BD-rate gain over the state-of-the-art baseline on the Kodak dataset.

Diversify, Contextualize, and Adapt: Efficient Entropy Modeling for Neural Image Codec

TL;DR

This paper proposes a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate, and introduces a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context.

Abstract

Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently developed. They not only reduce decoding time by using smaller number of modeling steps but also maintain or even improve rate--distortion performance by leveraging more diverse contexts for backward adaptation. Despite their significant progress, we argue that their performance has been limited by the simple adoption of the design convention for forward adaptation: using only a single type of hyper latent representation, which does not provide sufficient contextual information, especially in the first modeling step. In this paper, we propose a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate. Specifically, we introduce a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context. In addition, we present a method to effectively use the diverse contexts for contextualizing the current elements to be encoded/decoded. By addressing the limitation of the previous approach, our proposed framework leads to significant performance improvements. Experimental results on popular datasets show that our proposed framework consistently improves rate--distortion performance across various bit-rate regions, e.g., 3.73% BD-rate gain over the state-of-the-art baseline on the Kodak dataset.

Paper Structure

This paper contains 29 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: DCAdiversifies the hyper latent representations and contextualizes the current elements by leveraging the diverse hyper latent representations along with the previous elements. As a result, the probability distributions adapt effectively, leading to accurate entropy modeling.
  • Figure 2: Overview of the neural image codec with the proposed entropy model, referred to as DCA. DCA can be employed by any analysis and synthesis transforms $f_a(\cdot)$ and $f_s(\cdot)$. DCA is an adaptive entropy model consisting of two main stages: diversify (\ref{['subsec:diversify']}) and contextualize (\ref{['subsec:contextualize']}). First, given the latent representation $\bm y$, DCA extracts diverse hyper latent representations $\hat{\bm z}_l$, $\hat{\bm z}_r$, and $\hat{\bm z}_g$, and then encodes them into the bitstreams using learned factorized entropy models, which are omitted in this figure for simplicity. Second, contextualization proceeds over four steps. By using the three features $\bm \phi_l$, $\bm \phi_r$, and $\bm \phi_g$ (from the three hyper latent representations, respectively) and all the previously encoded/decoded elements before the $i$-th step, i.e., $\hat{\bm y}^{<i}$, DCA contextualizes the current elements to be encoded/decoded, i.e., $\hat{\bm y}^{i}$, and finally obtains adaptive distribution parameters $\bm{\mu}^i$ and $\bm{\sigma}^i$ for probability modeling. Using the learned adaptive probability model, the quantized latent representation $\hat{\bm y}$ are encoded into a bitstream.
  • Figure 3: Example of the quadtree partition-based backward adaptation for $\hat{\bm y}\in \operatorname{\mathbb R}^{4\times 4\times 320}$. For simplicity, channel dimensions are represented via different colors. $\hat{\bm y}^i$ means the elements to be encoded/decoded at the $i$-th step. For modeling the current elements $\hat{\bm y}^i$, all the previous modeled elements $\hat{\bm y}^{< i}$ are used. For example, the elements corresponding to the red arrow leverage diverse contexts including elements across different channels at the same spatial location (local context denoted as L) and spatially adjacent four elements of the same channel (regional context denoted as R).
  • Figure 4: Performance comparison with latest entropy models on the two benchmark datasets: (a) Kodak and (b) Tecnick. For clear comparisons, we denote each method as follows. B and F mean backward and forward adaptation, respectively, and the corresponding methods are written in parentheses. For backward adaptation, AR, ChARM, and Quadtree represent spatial autoregressive model, channel-wise autoregressive model, and qaudtree partition-based model, respectively. For forward adaptation, L, R, G mean local, regional, and global hyper latent representations, respectively.
  • Figure 5: Performance comparison with latest entropy models on the Kodak dataset in terms of decoding time, BD-rate, and model size. Decoding time is measured on a NVIDIA V100 GPU. BD-rate means average rate savings over VTM-12.1. The size of the circle is determined proportionally to the number of model parameters, and the specific numbers are written to the left of the circles.
  • ...and 8 more figures