Table of Contents
Fetching ...

Efficient Masked Image Compression with Position-Indexed Self-Attention

Chengjie Dai, Tiantian Song, Hui Tang, Fangdong Chen, Bowei Yang, Guanghua Song

TL;DR

This work addresses the inefficiency of semantic-mask-based image compression by introducing EMIC, a framework that encodes and decodes only visible patches after semantic masking using a position-indexed self-attention mechanism. By employing a mask-before-encoding strategy with a hierarchical transformer and a decomposed Manhattan-based attention (DPISA), EMIC achieves significant reductions in FLOPs and encoding time while maintaining competitive performance on downstream tasks and image quality within the masked regions. The approach leverages a transformer-based entropy model and a rate-distortion objective computed over the visible patches, and it demonstrates strong improvements over baselines like GIT-SSIC in computational efficiency with comparable task accuracy. The results suggest practical impact for deploying compression in machine vision pipelines where human verification is also needed, enabling efficient, single-pass processing for both machine and human vision tasks.

Abstract

In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.

Efficient Masked Image Compression with Position-Indexed Self-Attention

TL;DR

This work addresses the inefficiency of semantic-mask-based image compression by introducing EMIC, a framework that encodes and decodes only visible patches after semantic masking using a position-indexed self-attention mechanism. By employing a mask-before-encoding strategy with a hierarchical transformer and a decomposed Manhattan-based attention (DPISA), EMIC achieves significant reductions in FLOPs and encoding time while maintaining competitive performance on downstream tasks and image quality within the masked regions. The approach leverages a transformer-based entropy model and a rate-distortion objective computed over the visible patches, and it demonstrates strong improvements over baselines like GIT-SSIC in computational efficiency with comparable task accuracy. The results suggest practical impact for deploying compression in machine vision pipelines where human verification is also needed, enabling efficient, single-pass processing for both machine and human vision tasks.

Abstract

In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.

Paper Structure

This paper contains 17 sections, 4 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Differences between EMIC and GIT-SSIC in Semantic Masking
  • Figure 2: The framework of the masked image compression network. Q denotes quantization, AE denotes arithmetic encoding, AD denotes arithmetic decoding, and $L_i$ represents the number of EMIC blocks in the i-th stage (In this paper, $L_1$ to $L_4$ are 2, 2, 6 and 2, respectively). The upper part is the encoding process, while the lower part is the decoding process.
  • Figure 3: Illustration of block merging in the normal case (upper part) and the masked image case (lower part).
  • Figure 4: The attention unit and mask unit.
  • Figure 5: Sample the current Manhattan distance decay matrix based on the position indices of the visible blocks.
  • ...and 5 more figures