Table of Contents
Fetching ...

GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression

Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao

TL;DR

GroupedMixer introduces a group-wise autoregressive transformer for learned image compression, partitioning latents into G groups and using inner-group and cross-group token-mixers to capture spatial-channel context with reduced computation. A context cache optimizes inference by reusing cross-group attention activations, enabling faster coding speeds. The approach delivers state-of-the-art rate-distortion performance on Kodak, CLIC'21, and Tecnick datasets, with substantial BD-rate savings over VVC and prior transformer/CNN-based methods, while maintaining practical latency. The combination of shared transformer weights, decomposed attention, and caching yields improved efficiency and scalability for high-resolution images in learned compression.

Abstract

Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.

GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression

TL;DR

GroupedMixer introduces a group-wise autoregressive transformer for learned image compression, partitioning latents into G groups and using inner-group and cross-group token-mixers to capture spatial-channel context with reduced computation. A context cache optimizes inference by reusing cross-group attention activations, enabling faster coding speeds. The approach delivers state-of-the-art rate-distortion performance on Kodak, CLIC'21, and Tecnick datasets, with substantial BD-rate savings over VVC and prior transformer/CNN-based methods, while maintaining practical latency. The combination of shared transformer weights, decomposed attention, and caching yields improved efficiency and scalability for high-resolution images in learned compression.

Abstract

Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.
Paper Structure (31 sections, 18 equations, 10 figures, 4 tables)

This paper contains 31 sections, 18 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: (a) GroupedMixer overview. Preprocessed latent representations $\bm{\hat{y}}$ are then passed through GroupedMixer modules to aggregate group-wise context, and are finally projected as distribution parameters. (b) Illustration of token-mixers. Cross-group token-mixer mixes the information between previously decoded groups, while inner-group token-mixer mixes the information within groups. (c) Detailed network architectures of two token-mixers, where MSA represents multi-head self-attention, and PEG denotes position embedding generator.
  • Figure 2: Grouping scheme for modeling spatial-channel context. The latent representations are separated along channel and spatial dimensions into $G=k_c\cdot k_h \cdot k_w$ groups sequentially, and number indicates order of autoregression. In this figure, we use $(k_c,k_h,k_w)=(2,2,2)$ as an example.
  • Figure 3: Detailed structure of Parameters Net. $c$ is the number of channels in each group.
  • Figure 4: Context cache optimization at inference time. Each block denotes a group of attention activation values.
  • Figure 5: Performance evaluation on the Kodak dataset.
  • ...and 5 more figures