Table of Contents
Fetching ...

MISC: Ultra-low Bitrate Image Semantic Compression Driven by Large Multimodal Model

Chunyi Li, Guo Lu, Donghui Feng, Haoning Wu, Zicheng Zhang, Xiaohong Liu, Guangtao Zhai, Weisi Lin, Wenjun Zhang

TL;DR

This work tackles ultra-low bitrate image compression by separating semantic content from pixel-level detail. It introduces MISC, a framework that uses Large Multimodal Models to extract semantic information, annotate spatial regions with Name-Detail-Map maps, and reconstruct images via a diffusion-based decoder guided by semantic constraints, achieving high consistency and perceptual quality at around 0.02–0.05 bpp. AIGI-SCD, a high-quality AIGI dataset, is constructed to evaluate compression across NSIs and AIGIs, and experiments on CLIC2020 and AIGI-SCD demonstrate state-of-the-art performance with dynamic bitrate adjustment and strong robustness to content type. The approach signals a practical, scalable direction for future storage and communication systems in the AI-generated content era, leveraging an LMM-driven paradigm for semantic image compression.

Abstract

With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.

MISC: Ultra-low Bitrate Image Semantic Compression Driven by Large Multimodal Model

TL;DR

This work tackles ultra-low bitrate image compression by separating semantic content from pixel-level detail. It introduces MISC, a framework that uses Large Multimodal Models to extract semantic information, annotate spatial regions with Name-Detail-Map maps, and reconstruct images via a diffusion-based decoder guided by semantic constraints, achieving high consistency and perceptual quality at around 0.02–0.05 bpp. AIGI-SCD, a high-quality AIGI dataset, is constructed to evaluate compression across NSIs and AIGIs, and experiments on CLIC2020 and AIGI-SCD demonstrate state-of-the-art performance with dynamic bitrate adjustment and strong robustness to content type. The approach signals a practical, scalable direction for future storage and communication systems in the AI-generated content era, leveraging an LMM-driven paradigm for semantic image compression.

Abstract

With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.
Paper Structure (20 sections, 10 equations, 10 figures, 4 tables)

This paper contains 20 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The framework of the MISC model including LMM/map/image encoders and an LMM decoder. The compressed content (noted in green) has an extremely compressed image bitstream lower than 0.024 bpp, a detailed description of a whole image, and items' names, details, and supposed position maps. The decoder controls the diffusion process according to the above content to generate images that simultaneously satisfy high consistency and perceptual quality.
  • Figure 2: Comparison of mapping spatial domain into frequency or semantic domain. Both methods compress images by retaining important information and discarding other information.
  • Figure 3: The positional map of three items, and the decompressed image with/without maps. When (b) and (c) are not used as constraints, 'wooden' and 'grass' will affect the 'bike' region respectively.
  • Figure 4: The normalized probability distributions of the low-level attributes. The distributions include NSIs in the Kodak24 database:kodak, CLIC2020 database:clic, Tecnick database:tecnick, and the proposed AIGI-SCD database. The AIGIs have a sharper distribution and more common blur.
  • Figure 5: Quality score comparison of the AIGI database. All existing AIGI databases have flaws in at least one quality indicator, while six quality scores of AIGI-SCD are all satisfactory, making them suitable for image compression tasks.
  • ...and 5 more figures