MISC: Ultra-low Bitrate Image Semantic Compression Driven by Large Multimodal Model
Chunyi Li, Guo Lu, Donghui Feng, Haoning Wu, Zicheng Zhang, Xiaohong Liu, Guangtao Zhai, Weisi Lin, Wenjun Zhang
TL;DR
This work tackles ultra-low bitrate image compression by separating semantic content from pixel-level detail. It introduces MISC, a framework that uses Large Multimodal Models to extract semantic information, annotate spatial regions with Name-Detail-Map maps, and reconstruct images via a diffusion-based decoder guided by semantic constraints, achieving high consistency and perceptual quality at around 0.02–0.05 bpp. AIGI-SCD, a high-quality AIGI dataset, is constructed to evaluate compression across NSIs and AIGIs, and experiments on CLIC2020 and AIGI-SCD demonstrate state-of-the-art performance with dynamic bitrate adjustment and strong robustness to content type. The approach signals a practical, scalable direction for future storage and communication systems in the AI-generated content era, leveraging an LMM-driven paradigm for semantic image compression.
Abstract
With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.
