Table of Contents
Fetching ...

Variable Rate Image Compression via N-Gram Context based Swin-transformer

Priyanka Mudgal

TL;DR

This work tackles the challenge of variable-rate learned image compression with a single model by introducing N-gram context into the Swin Transformer. The proposed NSTB combines uni-gram embeddings, sliding-WSA, and a TAG-MLP to expand the effective receptive field and preserve detail, while ROI-aware rate optimization biases encoding toward semantically important regions. Across Kodak and COCO-based evaluation, the method achieves BD-rate reductions and PSNR gains, with pronounced improvements in ROI regions and competitive complexity. The results suggest this approach offers practical, region-adaptive compression suitable for real-world applications and potential extensions to video.

Abstract

This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.

Variable Rate Image Compression via N-Gram Context based Swin-transformer

TL;DR

This work tackles the challenge of variable-rate learned image compression with a single model by introducing N-gram context into the Swin Transformer. The proposed NSTB combines uni-gram embeddings, sliding-WSA, and a TAG-MLP to expand the effective receptive field and preserve detail, while ROI-aware rate optimization biases encoding toward semantically important regions. Across Kodak and COCO-based evaluation, the method achieves BD-rate reductions and PSNR gains, with pronounced improvements in ROI regions and competitive complexity. The results suggest this approach offers practical, region-adaptive compression suitable for real-world applications and potential extensions to video.

Abstract

This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.

Paper Structure

This paper contains 11 sections, 7 figures.

Figures (7)

  • Figure 1: The visualization of the kodim24 reconstruction from the Kodak dataset shows that our method achieves better PSNR while maintaining or reducing the bit-rate compared to baseline 10222853 and traditional methods. The subtitles indicate PSNR↑/bpp↓.
  • Figure 2: BD-rate comparison of our proposed method using N-gram context with the baseline method 10222853.
  • Figure 3: The architecture of our proposed network is based on 10222853. The analysis $g_a$ and synthesis transform $g_s$ convert variables from image space (x) to latent space (y) and from latent space ($\hat{y}$) to image space ($\hat{x}$) respectively. EC and ED represent the arithmetic encoder and arithmetic decoder, respectively. The hyperprior analysis and synthesis transforms are from Minnen et al. ballé2018variational. Blocks with dotted outline shows NSTB adopted from choi2023ngramswintransformersefficient. It contains the uni-Gram embedding and sliding-WSA process. The dimensionality reduction via uni-Gram embedding enhances the efficiency of sliding-WSA. Bi-directional contexts share the same sliding-WSA weights. For window-wise summation, a value from $z_{ng}$ is added equally to $M^2$ pixels in a local window at the corresponding position.
  • Figure 4: RD-performance: (a) Variable-rate coding without ROI on Kodak. (b) Variable-rate coding with ROI on COCO dataset showing the comparison of baseline method 10222853 with our approach. (c) Variable-rate coding with NROI on COCO dataset. (d) Variable-rate coding with ROI approach on full image of COCO dataset.
  • Figure 5: Visualization of our method across different QIndexs and the bit-allocation map for the channel with maximal entropy. The results demonstrate that our approach allocates more bits to high-contrast regions, enhancing their quality, while assigning fewer bits to low-contrast areas, such as the sky and clouds. Corresponding QIndexs, PSNR↑/bpp↓ are mentioned below each image.
  • ...and 2 more figures