Variable Rate Image Compression via N-Gram Context based Swin-transformer
Priyanka Mudgal
TL;DR
This work tackles the challenge of variable-rate learned image compression with a single model by introducing N-gram context into the Swin Transformer. The proposed NSTB combines uni-gram embeddings, sliding-WSA, and a TAG-MLP to expand the effective receptive field and preserve detail, while ROI-aware rate optimization biases encoding toward semantically important regions. Across Kodak and COCO-based evaluation, the method achieves BD-rate reductions and PSNR gains, with pronounced improvements in ROI regions and competitive complexity. The results suggest this approach offers practical, region-adaptive compression suitable for real-world applications and potential extensions to video.
Abstract
This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
