Table of Contents
Fetching ...

LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression

Shimon Murai, Heming Sun, Jiro Katto

TL;DR

This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model, and proposes a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58% improvement in LPIPS BD-rate compared to existing methods.

Abstract

Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality by using image captions as sub-information. This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model. We also propose a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58\% improvement in LPIPS BD-rate compared to existing methods. Our implementation and pre-trained weights are available at https://github.com/tokkiwa/ImageTextCoding.

LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression

TL;DR

This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model, and proposes a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58% improvement in LPIPS BD-rate compared to existing methods.

Abstract

Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality by using image captions as sub-information. This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model. We also propose a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58\% improvement in LPIPS BD-rate compared to existing methods. Our implementation and pre-trained weights are available at https://github.com/tokkiwa/ImageTextCoding.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Compression results of Kodim15.png kodak with a zoom-in. Our model achieves the ultra-low bitrate compression while avoiding the color distortion seen in MISCMISC.
  • Figure 2: Our network architecture. The image is compressed to image bitstream with LIC model (above path), and at the same time, transported to LMM encoder (below path) to generate caption. The generated caption is then encoded to text bitstream. The two bitstreams are then decompressed and feeded to diffusion model.
  • Figure 3: The visualization of LMM text compression.
  • Figure 4: Relationship between bpp and LPIPS.
  • Figure 5: Relationship between bpp and CLIP Similarity.
  • ...and 1 more figures