Table of Contents
Fetching ...

UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Zongyu Guo, Yiting Lu, Yulin Ren, Zhibo Chen

TL;DR

UniMIC addresses the challenge of achieving universal perceptual-quality image compression across diverse codecs by leveraging cross-modality priors. The framework builds a compositional visual codec repository, introduces multi-grained textual coding with Content Prompt and Compression Prompt, and deploys a universal perceptual compensator powered by Stable Diffusion to deliver text-assisted, diffusion-guided reconstructions. A two-stage training scheme, with a lightweight, trainable diffusion component and a decoder refinement module, enables unified optimization of rate, distortion, and perceptual quality across both traditional and neural codecs, including ultra-low bitrate scenarios. Empirical results demonstrate broad perceptual gains (lower LPIPS and FID) across multiple codecs, with extensibility to unseen codecs and flexible distortion-perception trade-offs, making UniMIC practical for diverse deployment contexts. The work advances perceptual image compression by unifying cross-modality priors with a versatile, codec-agnostic reconstruction framework, potentially enabling improved user experiences at variable bitrates in real-world systems.

Abstract

We present UniMIC, a universal multi-modality image compression framework, intending to unify the rate-distortion-perception (RDP) optimization for multiple image codecs simultaneously through excavating cross-modality generative priors. Unlike most existing works that need to design and optimize image codecs from scratch, our UniMIC introduces the visual codec repository, which incorporates amounts of representative image codecs and directly uses them as the basic codecs for various practical applications. Moreover, we propose multi-grained textual coding, where variable-length content prompt and compression prompt are designed and encoded to assist the perceptual reconstruction through the multi-modality conditional generation. In particular, a universal perception compensator is proposed to improve the perception quality of decoded images from all basic codecs at the decoder side by reusing text-assisted diffusion priors from stable diffusion. With the cooperation of the above three strategies, our UniMIC achieves a significant improvement of RDP optimization for different compression codecs, e.g., traditional and learnable codecs, and different compression costs, e.g., ultra-low bitrates. The code will be available in https://github.com/Amygyx/UniMIC .

UniMIC: Towards Universal Multi-modality Perceptual Image Compression

TL;DR

UniMIC addresses the challenge of achieving universal perceptual-quality image compression across diverse codecs by leveraging cross-modality priors. The framework builds a compositional visual codec repository, introduces multi-grained textual coding with Content Prompt and Compression Prompt, and deploys a universal perceptual compensator powered by Stable Diffusion to deliver text-assisted, diffusion-guided reconstructions. A two-stage training scheme, with a lightweight, trainable diffusion component and a decoder refinement module, enables unified optimization of rate, distortion, and perceptual quality across both traditional and neural codecs, including ultra-low bitrate scenarios. Empirical results demonstrate broad perceptual gains (lower LPIPS and FID) across multiple codecs, with extensibility to unseen codecs and flexible distortion-perception trade-offs, making UniMIC practical for diverse deployment contexts. The work advances perceptual image compression by unifying cross-modality priors with a versatile, codec-agnostic reconstruction framework, potentially enabling improved user experiences at variable bitrates in real-world systems.

Abstract

We present UniMIC, a universal multi-modality image compression framework, intending to unify the rate-distortion-perception (RDP) optimization for multiple image codecs simultaneously through excavating cross-modality generative priors. Unlike most existing works that need to design and optimize image codecs from scratch, our UniMIC introduces the visual codec repository, which incorporates amounts of representative image codecs and directly uses them as the basic codecs for various practical applications. Moreover, we propose multi-grained textual coding, where variable-length content prompt and compression prompt are designed and encoded to assist the perceptual reconstruction through the multi-modality conditional generation. In particular, a universal perception compensator is proposed to improve the perception quality of decoded images from all basic codecs at the decoder side by reusing text-assisted diffusion priors from stable diffusion. With the cooperation of the above three strategies, our UniMIC achieves a significant improvement of RDP optimization for different compression codecs, e.g., traditional and learnable codecs, and different compression costs, e.g., ultra-low bitrates. The code will be available in https://github.com/Amygyx/UniMIC .

Paper Structure

This paper contains 20 sections, 3 equations, 38 figures, 4 tables.

Figures (38)

  • Figure 1: Visual comparisons of our proposed UniMIC framework with eight representative basic codecs, including the hand-crafted codec VTM VVC, HM HEVC, JPEGwallace1992jpeg, MSE-optimized neural codec ELIC he2022elic, cheng20-mse cheng2020learned, mbt2018 minnen2018joint, MS-SSIM-optimized cheng20-msssim on Kodak Dateset. Our method achieves more realistic and clear reconstructions than all basic codecs.
  • Figure 2: Illustration of our proposed UniMIC. The visual codec repository includes various representative basic codecs and provides visual representations. Multi-grained textual information, composed of variable-length content prompt and compression prompt, is losslessly transmitted and processed by CLIP text encoder. Finally, the universal perceptual compensator takes the multi-modality information to conduct the diffusion process.
  • Figure 3: Overall performance comparison between our method and state-of-the-art codecs on DIV2K.
  • Figure 4:
  • Figure 5: 0.0073bpp
  • ...and 33 more figures