Table of Contents
Fetching ...

Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression

A. Burakhan Koyuncu, Panqi Jia, Atanas Boev, Elena Alshina, Eckehard Steinbach

TL;DR

The Efficient Contextformer (eContextformer) is introduced – a computationally efficient transformer-based autoregressive context model for learned image compression that achieves up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

Abstract

Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression

TL;DR

The Efficient Contextformer (eContextformer) is introduced – a computationally efficient transformer-based autoregressive context model for learned image compression that achieves up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

Abstract

Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.
Paper Structure (25 sections, 5 equations, 10 figures, 9 tables)

This paper contains 25 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of the context modeling process, where the symbol probability of the current latent variable $\left(\right)$ estimated by aggregating the information of the latent variables $\left(\right)$. The previously decoded latent elements not joining to context modeling and yet to be coded elements are depicted as $\left(\right)$ and $\left(\right)$, respectively. The illustrated context models are (a) the model with 2D masked convolutions minnen2018jointcheng2020learned, (b) the model with 3D masked convolutions liu2019nonmentzer2018conditional, (c) channel-wise autoregressive model minnen2020channel, and (d– e) Contextformer with sfo and cfo coding mode koyuncu2022contextformer, respectively.
  • Figure 2: Illustration of different parallelization techniques for the context modeling in (a) patch-wise grouping koyuncu2021parallel, (b) checkered grouping he2021checkerboard, (c) channel-wise grouping minnen2020channel, and (d-e) combination of checkered and channel-wise grouping with sfo and cfo coding, respectively. All latent elements within the same group (depicted with the same color) are coded simultaneously, while the context model aggregates the information from the previously coded groups. For instance, koyuncu2021parallelhe2021checkerboard use 2D masked convolutions in the context model, and minnen2020channel applies multiple CNNs to channel-wise concatenated groups. The context model of he2022elic combines the techniques of minnen2020channelhe2021checkerboard and can be illustrated as in (d). Our proposed model (eContextformer), as well as the experimental model (pContextformer), use the parallelization techniques depicted in (d– e). However, our models employ spatio-channel attention in context modeling and do not require additional networks for channel-wise concatenation.
  • Figure 3: Experimental study on Kodak dataset franzen1999kodak, comparing the rate-distortion performance of different model configurations
  • Figure 4: Illustration of our compression framework utilizing with the eContextformer with window and shifted-window spatio-channel attention. The segment generator splits the latent into $N_{cs}$ channel segments for further processing. Following our previous work koyuncu2022contextformer, the output of hyperdecoder is not segmented but repeated along channel dimension to include more channel-wise local neighbors for the entropy modeling.
  • Figure 5: Illustration of the optimized processing steps of eContextformer. From left to right, the latent tensor (a) is first split into channel segments (b) and reordered according to group coding order (c). Finally, the transformer layers with window and shifted-window spatio-channel attention (d-e) are applied on the reordered tensor, sequentially.
  • ...and 5 more figures